Solr
  1. Solr
  2. SOLR-342

Add support for Lucene's new Indexing and merge features (excluding Document/Field/Token reuse)

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: update
    • Labels:
      None

      Description

      LUCENE-843 adds support for new indexing capabilities using the setRAMBufferSizeMB() method that should significantly speed up indexing for many applications. To fix this, we will need trunk version of Lucene (or wait for the next official release of Lucene)

      Side effect of this is that Lucene's new, faster StandardTokenizer will also be incorporated.

      Also need to think about how we want to incorporate the new merge scheduling functionality (new default in Lucene is to do merges in a background thread)

      1. SOLR-342.patch
        22 kB
        Grant Ingersoll
      2. SOLR-342.patch
        22 kB
        Grant Ingersoll
      3. copyLucene.sh
        1 kB
        Grant Ingersoll
      4. SOLR-342.patch
        33 kB
        Grant Ingersoll
      5. SOLR-342.tar.gz
        781 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          I think we should wait a bit more on this, as there still are a fair number of issues related to these changes in Lucene that need to be ironed out.

          Show
          Grant Ingersoll added a comment - I think we should wait a bit more on this, as there still are a fair number of issues related to these changes in Lucene that need to be ironed out.
          Hide
          Mike Klaas added a comment -

          We could probably just wait for lucene 2.3 to be released before releasing 1.3. I wouldn't be averse to pre-integrating the changes, though.

          Show
          Mike Klaas added a comment - We could probably just wait for lucene 2.3 to be released before releasing 1.3. I wouldn't be averse to pre-integrating the changes, though.
          Hide
          Grant Ingersoll added a comment -

          yep, I agree with pre-integrating, just from watching the Lucene discussions going on lately, I think it is worth letting a few more things be worked out before using a nightly build.

          Show
          Grant Ingersoll added a comment - yep, I agree with pre-integrating, just from watching the Lucene discussions going on lately, I think it is worth letting a few more things be worked out before using a nightly build.
          Hide
          Grant Ingersoll added a comment -

          Updated to cover the broader scope of changes that effect upgrading to Lucene trunk.

          Plan to implement:
          Add <ramBufferSizeMB> tag to specify the number of megabytes to give Lucene. Setting this value will override all other related settings (maxBufferedDocs, etc.) related to IndexWriter configuration

          Add <mergeScheduler> tag that can have two values: concurrent or serial. Or would it be better to take in a classname? Doing the latter would mean we would have to have a no-arg constructor, right?

          Add <mergePolicy> tag that defines the merge policy that can have two values: byteSize or docCount. Or would it be better to take a classname?

          NOTE: I am not proposing to handle the new reusable Document/Field/Token mechanism in Lucene, which should also be considered.

          Show
          Grant Ingersoll added a comment - Updated to cover the broader scope of changes that effect upgrading to Lucene trunk. Plan to implement: Add <ramBufferSizeMB> tag to specify the number of megabytes to give Lucene. Setting this value will override all other related settings (maxBufferedDocs, etc.) related to IndexWriter configuration Add <mergeScheduler> tag that can have two values: concurrent or serial. Or would it be better to take in a classname? Doing the latter would mean we would have to have a no-arg constructor, right? Add <mergePolicy> tag that defines the merge policy that can have two values: byteSize or docCount. Or would it be better to take a classname? NOTE: I am not proposing to handle the new reusable Document/Field/Token mechanism in Lucene, which should also be considered.
          Hide
          Yonik Seeley added a comment -

          Perhaps for now mergePolicy should be aligned with the buffering limit (ram or docs)?
          If/when we do add <mergePolicy> it seems like it should be a class name.

          Show
          Yonik Seeley added a comment - Perhaps for now mergePolicy should be aligned with the buffering limit (ram or docs)? If/when we do add <mergePolicy> it seems like it should be a class name.
          Hide
          Grant Ingersoll added a comment -

          OK, this would imply it is set to LogByteSizeMergePolicy when setting <ramBufferSizeMB> and LogDocMergePolicy when setting <maxBufferedDocs>

          Show
          Grant Ingersoll added a comment - OK, this would imply it is set to LogByteSizeMergePolicy when setting <ramBufferSizeMB> and LogDocMergePolicy when setting <maxBufferedDocs>
          Hide
          Grant Ingersoll added a comment -

          First crack at implementing this. All tests pass on OS X except SolrJ's SolrExceptionTest, but for some reason that is failing on a clean version, too, so I am convinced it is not due to anything I did.

          My personal benchmarking of just the Lucene side of things (see indexing.alg in Lucene contrib/benchmark) show pretty significant performance gains. This is also anecdotally confirmed by my basic testing in Solr.

          I set the default to be 16MB, per Mike McCandless defaults in Lucene, but this is probably too low given the server nature of Solr where a lot more memory is likely to be available.

          There are 4 new configuration possibilities:
          <ramBufferSizeMB> - When set, <maxBufferedDocs> is set to DISABLE_AUTO_FLUSH. Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be)
          <mergePolicy> - Set the MergePolicy, default is the new Lucene LogByteSizeMergePolicy. Old Lucene policy is LogDocMergePolicy. LogByteSizeMergePolicy by default.
          <mergeScheduler> - Set the way merges are performed. New way is ConcurrentMergeScheduler which runs the merges in separate background threads. Old way was SerialMergeScheduler. Concurrent by default.
          <luceneAutoCommit> - Specify whether Lucene IndexWriter should autoCommit flushes. false is the best for performance. Still need to develop recommendations for when to change this. Named it this way to avoid confusion with Solr's version. false by default.

          Patch is inside the tar file, as well as a bundling of the Lucene jars (not technically the latest, but only a couple days old)

          Show
          Grant Ingersoll added a comment - First crack at implementing this. All tests pass on OS X except SolrJ's SolrExceptionTest, but for some reason that is failing on a clean version, too, so I am convinced it is not due to anything I did. My personal benchmarking of just the Lucene side of things (see indexing.alg in Lucene contrib/benchmark) show pretty significant performance gains. This is also anecdotally confirmed by my basic testing in Solr. I set the default to be 16MB, per Mike McCandless defaults in Lucene, but this is probably too low given the server nature of Solr where a lot more memory is likely to be available. There are 4 new configuration possibilities: <ramBufferSizeMB> - When set, <maxBufferedDocs> is set to DISABLE_AUTO_FLUSH. Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be) <mergePolicy> - Set the MergePolicy, default is the new Lucene LogByteSizeMergePolicy. Old Lucene policy is LogDocMergePolicy. LogByteSizeMergePolicy by default. <mergeScheduler> - Set the way merges are performed. New way is ConcurrentMergeScheduler which runs the merges in separate background threads. Old way was SerialMergeScheduler. Concurrent by default. <luceneAutoCommit> - Specify whether Lucene IndexWriter should autoCommit flushes. false is the best for performance. Still need to develop recommendations for when to change this. Named it this way to avoid confusion with Solr's version. false by default. Patch is inside the tar file, as well as a bundling of the Lucene jars (not technically the latest, but only a couple days old)
          Hide
          Yonik Seeley added a comment -

          > Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be)

          +1
          ramBufferSizeMB is now the default in Lucene AFAIK.
          I think perhaps 32MB might be a good default.

          luceneAutoCommit [...] Still need to develop recommendations for when to change this.

          For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet.

          We do need to document recommended solrconfig.xml changes in CHANGES.txt (at the top in the migration section we normally have) for people to get these performance gains with existing configs.

          Show
          Yonik Seeley added a comment - > Default is the maxBufferedDocs way, but this could be changed to be the other way around (and probably should be) +1 ramBufferSizeMB is now the default in Lucene AFAIK. I think perhaps 32MB might be a good default. luceneAutoCommit [...] Still need to develop recommendations for when to change this. For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet. We do need to document recommended solrconfig.xml changes in CHANGES.txt (at the top in the migration section we normally have) for people to get these performance gains with existing configs.
          Hide
          Grant Ingersoll added a comment -

          +1
          ramBufferSizeMB is now the default in Lucene AFAIK.
          I think perhaps 32MB might be a good default.

          The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting?

          For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet.

          Should we even expose this, then? It seems like we should just make it false.

          I will write up more in the changes.

          Show
          Grant Ingersoll added a comment - +1 ramBufferSizeMB is now the default in Lucene AFAIK. I think perhaps 32MB might be a good default. The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting? For Solr, you never would want to use it. trying to catch a glimpse of new segments as they are flushed leads to an inconsistent view of the index since docs haven't been deleted yet. Should we even expose this, then? It seems like we should just make it false. I will write up more in the changes.
          Hide
          Yonik Seeley added a comment -

          > The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting?

          Since it takes no extra work in Solr, I guess we should just allow it.

          > Should we even expose this [lucene autocommit], then? It seems like we should just make it false.

          I think so... some people use Solr in some quite advanced ways.

          Show
          Yonik Seeley added a comment - > The other thing is, Lucene actually supports setting both, and flushes based on whichever one is hit first. Is this worth supporting? Since it takes no extra work in Solr, I guess we should just allow it. > Should we even expose this [lucene autocommit] , then? It seems like we should just make it false. I think so... some people use Solr in some quite advanced ways.
          Hide
          Grant Ingersoll added a comment -

          Changes:

          1. Updated changes.txt with recommendations on settings.

          2. Changed SolrIndexWriter from last patch to allow for setting both maxBufferedDocs and ramBufferSizeMB.

          3. Updated the various sample solrconfig.xml to have a default of 32 MB for ramBufferSizeMB. Commented out maxBufferedDocs, but did not deprecate it.

          4. Added a note to the various solrconfig.xml explaining what Lucene does if BOTH ramBufferSizeMB and maxBufferedDocs is set.

          The Lucene libraries are bundled with the previous patch, but are still needed.

          Show
          Grant Ingersoll added a comment - Changes: 1. Updated changes.txt with recommendations on settings. 2. Changed SolrIndexWriter from last patch to allow for setting both maxBufferedDocs and ramBufferSizeMB. 3. Updated the various sample solrconfig.xml to have a default of 32 MB for ramBufferSizeMB. Commented out maxBufferedDocs, but did not deprecate it. 4. Added a note to the various solrconfig.xml explaining what Lucene does if BOTH ramBufferSizeMB and maxBufferedDocs is set. The Lucene libraries are bundled with the previous patch, but are still needed.
          Hide
          Will Johnson added a comment -

          just a comment to say that we added this patch and saw rather signifigant improvements, on the order of 10-25% for different index tests.

          Show
          Will Johnson added a comment - just a comment to say that we added this patch and saw rather signifigant improvements, on the order of 10-25% for different index tests.
          Hide
          Grant Ingersoll added a comment -

          Dumb little script to copy over the required Lucene jars from a built Lucene directory. Takes in two parameters, the location of Lucene Home and the version to copy over. Requires Lucene to be built.

          Belongs in the lib directory.

          For example,
          ./copyLucene.sh <path to Lucene> 2.3-dev

          Show
          Grant Ingersoll added a comment - Dumb little script to copy over the required Lucene jars from a built Lucene directory. Takes in two parameters, the location of Lucene Home and the version to copy over. Requires Lucene to be built. Belongs in the lib directory. For example, ./copyLucene.sh <path to Lucene> 2.3-dev
          Hide
          Will Johnson added a comment -

          is there any update on getting this patch committed? we needed to be able to set some of the buffer sizes so the script wouldn't help. have other people experienced tourbles with 2.3 and/or this patch that i should be wary of?

          Show
          Will Johnson added a comment - is there any update on getting this patch committed? we needed to be able to set some of the buffer sizes so the script wouldn't help. have other people experienced tourbles with 2.3 and/or this patch that i should be wary of?
          Hide
          Grant Ingersoll added a comment -

          In my mind, there are still some issues w/ 2.3 dev that are being worked on. Personally, I think we should wait until 2.3 is released, but it would be good for people to get some running-time with this patch, if they can, before then, as that will help work out any issues remaining in 2.3.

          Show
          Grant Ingersoll added a comment - In my mind, there are still some issues w/ 2.3 dev that are being worked on. Personally, I think we should wait until 2.3 is released, but it would be good for people to get some running-time with this patch, if they can, before then, as that will help work out any issues remaining in 2.3.
          Hide
          Grant Ingersoll added a comment -

          Looks like Lucene 2.3 is shaping up to be released fairly soon (~2 weeks) and that many of the indexing/thread-safety concerns have been worked out. Might as well wait for the official release at this point, although I have been using 2.3-dev for a fairly long time at this point.

          Show
          Grant Ingersoll added a comment - Looks like Lucene 2.3 is shaping up to be released fairly soon (~2 weeks) and that many of the indexing/thread-safety concerns have been worked out. Might as well wait for the official release at this point, although I have been using 2.3-dev for a fairly long time at this point.
          Hide
          Yonik Seeley added a comment -

          I don't think it's necessary to wait for the official lucene 2.3 release, esp since there is still a lot of solr work to be done (tokenizer upgrades to use char[], reusable tokenizers, reusable analyzers?, etc). We could upgrade to the latest snapshot when someone is willing to tackle those issues.

          Show
          Yonik Seeley added a comment - I don't think it's necessary to wait for the official lucene 2.3 release, esp since there is still a lot of solr work to be done (tokenizer upgrades to use char[], reusable tokenizers, reusable analyzers?, etc). We could upgrade to the latest snapshot when someone is willing to tackle those issues.
          Hide
          Grant Ingersoll added a comment -

          Updated to work against trunk.

          As always, let me know if there is anything I can do to help get this committed.

          Show
          Grant Ingersoll added a comment - Updated to work against trunk. As always, let me know if there is anything I can do to help get this committed.
          Hide
          Koji Sekiguchi added a comment -

          Now mergeFactor can be effective as long as mergePolicy is an instance of LogMergePolicy.
          How about

          <mergePolicy mergeFactor="10"/>

          or

          <mergePolicy>
          <mergeFactor>10</mergeFactor>
          </mergePolicy>

          instead of

          <indexDefaults>
          <mergeFactor>10</mergeFactor>
          </indexDefaults>

          Show
          Koji Sekiguchi added a comment - Now mergeFactor can be effective as long as mergePolicy is an instance of LogMergePolicy. How about <mergePolicy mergeFactor="10"/> or <mergePolicy> <mergeFactor>10</mergeFactor> </mergePolicy> instead of <indexDefaults> <mergeFactor>10</mergeFactor> </indexDefaults>
          Hide
          Grant Ingersoll added a comment -

          I did some benchmarking of the autocommit functionality in Lucene (as opposed to in Solr, which is different). Currently, in Lucene autocommit is true by default, meaning that every time there is a flush, it is also committed. Solr adds its own layer on top of this with its commit semantics. There is a noticeable difference in memory used and speed in Lucene performance between autocommit = false and autocommit = true.

          Some rough numbers using the autocommit.alg in Lucene's benchmark contrib (from trunk):
          Operation round ac ram runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] MAddDocs_200000 0rue2.00 1 200000 400.1 499.90 61,322,608 68,780,032
          [java] MAddDocs_200000 - 1lse2.00 - - 1 - - 200000 - - 499.9 - - 400.08 - 49,373,632 - 75,018,240
          [java] MAddDocs_200000 2rue2.00 1 200000 383.7 521.27 70,716,096 75,018,240
          [java] MAddDocs_200000 - 3lse2.00 - - 1 - - 200000 - - 552.7 - - 361.89 - 68,069,464 - 75,018,240

          The first row has autocommit = true, second is false, and then alternating. The key value is the rec/s, which is:
          1. ac = true 400.1
          2. ac = false 499.9
          3. ac = true 383.7
          4. ac = false 552.7

          Notice also the diff in avgUsedMem. Adding this functionality may, perhaps, be more important to Solr's performance than the flush by RAM capability.

          Show
          Grant Ingersoll added a comment - I did some benchmarking of the autocommit functionality in Lucene (as opposed to in Solr, which is different). Currently, in Lucene autocommit is true by default, meaning that every time there is a flush, it is also committed. Solr adds its own layer on top of this with its commit semantics. There is a noticeable difference in memory used and speed in Lucene performance between autocommit = false and autocommit = true. Some rough numbers using the autocommit.alg in Lucene's benchmark contrib (from trunk): Operation round ac ram runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_200000 0rue2.00 1 200000 400.1 499.90 61,322,608 68,780,032 [java] MAddDocs_200000 - 1lse2.00 - - 1 - - 200000 - - 499.9 - - 400.08 - 49,373,632 - 75,018,240 [java] MAddDocs_200000 2rue2.00 1 200000 383.7 521.27 70,716,096 75,018,240 [java] MAddDocs_200000 - 3lse2.00 - - 1 - - 200000 - - 552.7 - - 361.89 - 68,069,464 - 75,018,240 The first row has autocommit = true, second is false, and then alternating. The key value is the rec/s, which is: 1. ac = true 400.1 2. ac = false 499.9 3. ac = true 383.7 4. ac = false 552.7 Notice also the diff in avgUsedMem. Adding this functionality may, perhaps, be more important to Solr's performance than the flush by RAM capability.
          Hide
          Grant Ingersoll added a comment -

          Update of patch to account for the fact that mergeFactor is only for Log based merges. I left it as the <mergeFactor> tag, but put in an instanceof clause in the init method of the SolrIndexWriter to check to see if the mergeFactor is settable.

          Show
          Grant Ingersoll added a comment - Update of patch to account for the fact that mergeFactor is only for Log based merges. I left it as the <mergeFactor> tag, but put in an instanceof clause in the init method of the SolrIndexWriter to check to see if the mergeFactor is settable.
          Hide
          Will Johnson added a comment -

          I think we're running into a very serious issue with trunk + this patch. either the document summaries are not matched or the overall matching is 'wrong'. i did find this in the lucene jira: LUCENE-994

          "Note that these changes will break users of ParallelReader because the
          parallel indices will no longer have matching docIDs. Such users need
          to switch IndexWriter back to flushing by doc count, and switch the
          MergePolicy back to LogDocMergePolicy. It's likely also necessary to
          switch the MergeScheduler back to SerialMergeScheduler to ensure
          deterministic docID assignment."

          we're seeing rather consistent bad results but only after 20-30k documents and multiple commits and wondering if anyone else is seeing anything. i've verified that the results are bad even though luke which would seem to remove the search side of hte solr equation. the basic test case is to search for title:foo and get back documents that only have title:bar. we're going to start on a unit test but give the document counts and the corpus we're testing against it may be a while so i thought i'd ask to see if anyone had any hints.

          removing this patch seems to remove the issue so i doesn't appear to be a lucene problem

          Show
          Will Johnson added a comment - I think we're running into a very serious issue with trunk + this patch. either the document summaries are not matched or the overall matching is 'wrong'. i did find this in the lucene jira: LUCENE-994 "Note that these changes will break users of ParallelReader because the parallel indices will no longer have matching docIDs. Such users need to switch IndexWriter back to flushing by doc count, and switch the MergePolicy back to LogDocMergePolicy. It's likely also necessary to switch the MergeScheduler back to SerialMergeScheduler to ensure deterministic docID assignment." we're seeing rather consistent bad results but only after 20-30k documents and multiple commits and wondering if anyone else is seeing anything. i've verified that the results are bad even though luke which would seem to remove the search side of hte solr equation. the basic test case is to search for title:foo and get back documents that only have title:bar. we're going to start on a unit test but give the document counts and the corpus we're testing against it may be a while so i thought i'd ask to see if anyone had any hints. removing this patch seems to remove the issue so i doesn't appear to be a lucene problem
          Hide
          Yonik Seeley added a comment -

          Yikes! Thanks for the report Will. It certainly sounds like a Lucene issue to me (esp because removal of this patch fixes things... that means it only happens under certain lucene settings). Could you perhaps try the very latest Lucene trunk (there were some seemingly unrelated fixes recently).

          Show
          Yonik Seeley added a comment - Yikes! Thanks for the report Will. It certainly sounds like a Lucene issue to me (esp because removal of this patch fixes things... that means it only happens under certain lucene settings). Could you perhaps try the very latest Lucene trunk (there were some seemingly unrelated fixes recently).
          Hide
          Yonik Seeley added a comment -

          Will, are you using term vectors anywhere, or any customizations to Solr (at the lucene level)?
          When you say "document summaries are not matched", you you mean that the incorrect documents are matched, or that the correct documents are matched but just highlighting is wrong?

          Show
          Yonik Seeley added a comment - Will, are you using term vectors anywhere, or any customizations to Solr (at the lucene level)? When you say "document summaries are not matched", you you mean that the incorrect documents are matched, or that the correct documents are matched but just highlighting is wrong?
          Hide
          Will Johnson added a comment -

          patched solr + lucene trunk is stil broken. if anyone has any pointers for ways to coax this problem to happen before we get 20-30k large docs in the system let me know and we can start working on a unit test, otherwise it's going to take a while to reproduce anything.

          Show
          Will Johnson added a comment - patched solr + lucene trunk is stil broken. if anyone has any pointers for ways to coax this problem to happen before we get 20-30k large docs in the system let me know and we can start working on a unit test, otherwise it's going to take a while to reproduce anything.
          Hide
          Yonik Seeley added a comment -

          Thanks Will. My guess at this point is a merging bug in Lucene, so you might be able to reproduce by forcing more merges. Make mergeFacor=2 and lower how many docs it takes to do a merge (set maxBufferedDocs to 2, or set ramBufferSizeMB to 1).

          Show
          Yonik Seeley added a comment - Thanks Will. My guess at this point is a merging bug in Lucene, so you might be able to reproduce by forcing more merges. Make mergeFacor=2 and lower how many docs it takes to do a merge (set maxBufferedDocs to 2, or set ramBufferSizeMB to 1).
          Hide
          Grant Ingersoll added a comment -

          Can you share your settings? (solrconfig.xml), or at least the relevant sections.

          Show
          Grant Ingersoll added a comment - Can you share your settings? (solrconfig.xml), or at least the relevant sections.
          Hide
          Will Johnson added a comment -

          we have:

          <mergeFactor>10</mergeFactor>
          <ramBufferSizeMB>64</ramBufferSizeMB>
          <maxMergeDocs>2147483647</maxMergeDocs>

          and i'm working on a unit test but just adding a few terms per doc doesnt seem to trigger it, at least not 'quickly.'

          Show
          Will Johnson added a comment - we have: <mergeFactor>10</mergeFactor> <ramBufferSizeMB>64</ramBufferSizeMB> <maxMergeDocs>2147483647</maxMergeDocs> and i'm working on a unit test but just adding a few terms per doc doesnt seem to trigger it, at least not 'quickly.'
          Hide
          Grant Ingersoll added a comment -

          You mentioned ParallelReader, are you using that, or any other patches?

          problem to happen before we get 20-30k large docs

          what is "large" in your terms?

          Show
          Grant Ingersoll added a comment - You mentioned ParallelReader, are you using that, or any other patches? problem to happen before we get 20-30k large docs what is "large" in your terms?
          Hide
          Will Johnson added a comment -

          we're not using parallel reader but we are using direct core access instead of going over http. as for doc size, we're indexing wikipedia but creating anumber of extra fields. they are only large in comparison to any of the 'large volume' tests i've seen in most of the solr and lucene tests.

          • will
          Show
          Will Johnson added a comment - we're not using parallel reader but we are using direct core access instead of going over http. as for doc size, we're indexing wikipedia but creating anumber of extra fields. they are only large in comparison to any of the 'large volume' tests i've seen in most of the solr and lucene tests. will
          Hide
          Grant Ingersoll added a comment -

          Direct core meaning embedded, right? It's interesting, b/c I have done a fair amount of Lucene 2.3 testing w/ Wikipedia (nothing like a free, fairly large dataset)

          Can you reproduce the problem using Lucene directly? (have a look at contrib/benchmark for a way to get Lucene/Wikipedia up and running quickly)

          Also, are there any associated exceptions anywhere in the chain? Or is it just that your index is bad? Are you starting from a clean index or updating an existing one?

          Show
          Grant Ingersoll added a comment - Direct core meaning embedded, right? It's interesting, b/c I have done a fair amount of Lucene 2.3 testing w/ Wikipedia (nothing like a free, fairly large dataset) Can you reproduce the problem using Lucene directly? (have a look at contrib/benchmark for a way to get Lucene/Wikipedia up and running quickly) Also, are there any associated exceptions anywhere in the chain? Or is it just that your index is bad? Are you starting from a clean index or updating an existing one?
          Hide
          Will Johnson added a comment -

          we're using SolrCore in terms of:

          core = new SolrCore("foo", dataDir, solrConfig, solrSchema);
          UpdateHandler handler = core.getUpdateHandler();
          updateHandler.addDoc(command);

          which is a bit more low level than normal however when we flipped back to solr trunk + lucene 2.3 everything was fine so it leads me to belive that we are ok in that respect.

          i was going to try and reproduce with lucene directly also but that too is a bit outside the scope of what i have time for at the moment.

          and we're not getting any exceptions, just bad search results.

          Show
          Will Johnson added a comment - we're using SolrCore in terms of: core = new SolrCore("foo", dataDir, solrConfig, solrSchema); UpdateHandler handler = core.getUpdateHandler(); updateHandler.addDoc(command); which is a bit more low level than normal however when we flipped back to solr trunk + lucene 2.3 everything was fine so it leads me to belive that we are ok in that respect. i was going to try and reproduce with lucene directly also but that too is a bit outside the scope of what i have time for at the moment. and we're not getting any exceptions, just bad search results.
          Hide
          Grant Ingersoll added a comment -

          Also, are you doing multi-threaded indexing or are you indexing while searching?

          Show
          Grant Ingersoll added a comment - Also, are you doing multi-threaded indexing or are you indexing while searching?
          Hide
          Will Johnson added a comment -

          we are doing multi-threaded indexing and searching while indexing however the 'bad' results come back after the indexing run is finished and the index itself is static.

          Show
          Will Johnson added a comment - we are doing multi-threaded indexing and searching while indexing however the 'bad' results come back after the indexing run is finished and the index itself is static.
          Hide
          Yonik Seeley added a comment -

          OK, I've managed to reproduce this in a straight lucene testcase.
          I'm doing further verification and will open up a Lucene bug shortly.

          Show
          Yonik Seeley added a comment - OK, I've managed to reproduce this in a straight lucene testcase. I'm doing further verification and will open up a Lucene bug shortly.
          Hide
          Grant Ingersoll added a comment -

          Sounds like we will have a Lucene 2.3.1 release in the next week or so with the fixes in place. Will, in the meantime, when LUCENE-1177 and LUCENE-1173 get committed,maybe you could try either building the LUCENE 2.3 branch or trying out with the Lucene nightly build to see if it solves your issue and let us know.

          Show
          Grant Ingersoll added a comment - Sounds like we will have a Lucene 2.3.1 release in the next week or so with the fixes in place. Will, in the meantime, when LUCENE-1177 and LUCENE-1173 get committed,maybe you could try either building the LUCENE 2.3 branch or trying out with the Lucene nightly build to see if it solves your issue and let us know.
          Hide
          Will Johnson added a comment -

          i switched to the lucene 2.3 branch, updated (and confirmed that yonik's 1 line change was in place), reran the tests and still saw the same problem. if i missed something please let me know.

          Show
          Will Johnson added a comment - i switched to the lucene 2.3 branch, updated (and confirmed that yonik's 1 line change was in place), reran the tests and still saw the same problem. if i missed something please let me know.
          Hide
          Michael McCandless added a comment -

          Will, did you create a new index in your test?

          Also make sure you are using this URL to checkout the 2.3 branch sources:

          https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3

          You should see 7 issues listed in the CHANGES.txt under "Bug fixes"?

          Show
          Michael McCandless added a comment - Will, did you create a new index in your test? Also make sure you are using this URL to checkout the 2.3 branch sources: https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3 You should see 7 issues listed in the CHANGES.txt under "Bug fixes"?
          Hide
          Michael McCandless added a comment -

          Will, one more thing to try is to on assertions for org.apache.lucene.*; this may catch an issue sooner.

          Show
          Michael McCandless added a comment - Will, one more thing to try is to on assertions for org.apache.lucene.*; this may catch an issue sooner.
          Hide
          Yonik Seeley added a comment -

          Will, you should be able to verify the lucene version with this link:
          http://localhost:8983/solr/admin/registry.jsp
          It should be different from this:
          Lucene Specification Version: 2.3.0
          Lucene Implementation Version: 2.3.0 613715 - buschmi - 2008-01-21 01:30:48

          Show
          Yonik Seeley added a comment - Will, you should be able to verify the lucene version with this link: http://localhost:8983/solr/admin/registry.jsp It should be different from this: Lucene Specification Version: 2.3.0 Lucene Implementation Version: 2.3.0 613715 - buschmi - 2008-01-21 01:30:48
          Hide
          Will Johnson added a comment -

          the new solr with the new lucene did the trick. i was made the mistake of using the 2.3 tag instead of the branch before which was why i still saw the problem.

          Show
          Will Johnson added a comment - the new solr with the new lucene did the trick. i was made the mistake of using the 2.3 tag instead of the branch before which was why i still saw the problem.
          Hide
          Michael McCandless added a comment -

          Super, I'm glad to hear that!

          Show
          Michael McCandless added a comment - Super, I'm glad to hear that!
          Hide
          Grant Ingersoll added a comment -

          I'm going to upgrade to 2.3.1 and then double check this and commit, unless I hear any objections in the next day or two.

          Show
          Grant Ingersoll added a comment - I'm going to upgrade to 2.3.1 and then double check this and commit, unless I hear any objections in the next day or two.
          Hide
          Grant Ingersoll added a comment -

          oops, we are already on 2.3.1, so then I will just commit in a day or two.

          Show
          Grant Ingersoll added a comment - oops, we are already on 2.3.1, so then I will just commit in a day or two.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 634016.

          Show
          Grant Ingersoll added a comment - Committed revision 634016.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development