Lucene - Core
  1. Lucene - Core
  2. LUCENE-2294

Create IndexWriterConfiguration and store all of IW configuration there

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters:

      • All of ctors parameters, except for Directory.
      • The different setters where it makes sense. For example I still think infoStream should be set on IW directly.

      I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter.

      I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 10000 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ...

      I plan to deprecate all the ctors and getters/setters and replace them by:

      • One ctor as described above
      • getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest.
      • About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig);
        • The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2());

      BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API.

      I'll start to work on a patch.

      1. LUCENE-2294.patch
        569 kB
        Michael McCandless
      2. LUCENE-2294.patch
        570 kB
        Shai Erera
      3. check.py
        1 kB
        Michael McCandless
      4. LUCENE-2294.patch
        572 kB
        Shai Erera
      5. LUCENE-2294.patch
        556 kB
        Shai Erera
      6. LUCENE-2294.patch
        62 kB
        Shai Erera
      7. LUCENE-2294.patch
        60 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Take 2! Thanks Shai.

          Show
          Michael McCandless added a comment - Take 2! Thanks Shai.
          Hide
          Shai Erera added a comment -

          Thanks a lot Robert for reviewing this. No harm done ... I've had the chance to exercise some of Eclipse tricks in the process or re-doing. Unfortunately it introduced some changes, but luckily we have Mike and his mighty-python-scripting-ability to protect us .

          Now I have a bunch of other issues I need to open, that were waiting for this guy to go in. Stay tuned

          Show
          Shai Erera added a comment - Thanks a lot Robert for reviewing this. No harm done ... I've had the chance to exercise some of Eclipse tricks in the process or re-doing. Unfortunately it introduced some changes, but luckily we have Mike and his mighty-python-scripting-ability to protect us . Now I have a bunch of other issues I need to open, that were waiting for this guy to go in. Stay tuned
          Hide
          Robert Muir added a comment -

          I can't wait for this to be in ... an exhausting issue

          Shai, thanks for taking the time to redo this massive patch. I'm sorry
          again I dropped the ball and didn't notice till the commit, forcing you
          to redo a lot of work.

          +1

          Show
          Robert Muir added a comment - I can't wait for this to be in ... an exhausting issue Shai, thanks for taking the time to redo this massive patch. I'm sorry again I dropped the ball and didn't notice till the commit, forcing you to redo a lot of work. +1
          Hide
          Shai Erera added a comment -

          Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it found some more (probably some I missed in the 1st pass), only this time really few more. So I fixed them as well. But I didn't run it 3rd time ...

          I can't wait for this to be in ... an exhausting issue .

          Show
          Shai Erera added a comment - Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it found some more (probably some I missed in the 1st pass), only this time really few more. So I fixed them as well. But I didn't run it 3rd time ... I can't wait for this to be in ... an exhausting issue .
          Hide
          Michael McCandless added a comment -

          Attached new patch, just fixing a couple tests where analyzer had changed.

          I it's ready to commit (take 2)! I'll wait a day or two...

          Show
          Michael McCandless added a comment - Attached new patch, just fixing a couple tests where analyzer had changed. I it's ready to commit (take 2)! I'll wait a day or two...
          Hide
          Michael McCandless added a comment -

          Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?

          Hmm some of these (at least TestAtomicUpdate was changed from Simple -> Whitespace) were in fact real changes.... I'll fix & post a new patch.

          Show
          Michael McCandless added a comment - Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them? Hmm some of these (at least TestAtomicUpdate was changed from Simple -> Whitespace) were in fact real changes.... I'll fix & post a new patch.
          Hide
          Michael McCandless added a comment -

          Thanks Shai, I'll look...

          Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?

          Yes if they are indeed false positives...

          Show
          Michael McCandless added a comment - Thanks Shai, I'll look... Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them? Yes if they are indeed false positives...
          Hide
          Shai Erera added a comment -

          Patch updated w/:

          1. Move setMergedSegmentWarmer to IWC.
          2. Revert the changes reported by check.py

          Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?

          Show
          Shai Erera added a comment - Patch updated w/: Move setMergedSegmentWarmer to IWC. Revert the changes reported by check.py Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them?
          Hide
          Shai Erera added a comment -

          Should we also move MergedSegmentWarmer to IWC?

          I guess it could ... wonder how I missed it ...

          This adds more Version.LUCENE_CURRENTs in contrib...

          I added it on purpose, so we know to convert these places when we remove LUCENE_CURRENT. We can also do this gradually in LUCENE-2305.

          There were some tests that previously passed null as analyzers

          Those were exactly the cases I wanted to move Analyzer out of the mandatory list. Whatever you pass will work for those tests, because they obviously don't rely on analysis ... otherwise they'd had been hit by NPE.
          Hmm ... I wonder if I can still pass 'null' for them. I used eclipse search/replace tool to add the Analyzer to IWC calls. The null argument disappeared completely because they used the default, so the search/replace tool replaced that w/ Whitespace. I prefer to keep it that way, to avoid re-changing them again.

          There are a number of tests where the analyzers were changed ...

          I thought that that's ok, because those tests don't rely on the output of analysis, or otherwise they should have tested/asserted it. Since they don't, it shouldn't matter to them. But I'll revert that change.

          So two things I need to do:

          1. revert the changes to those tests
          2. add warmer to
          Show
          Shai Erera added a comment - Should we also move MergedSegmentWarmer to IWC? I guess it could ... wonder how I missed it ... This adds more Version.LUCENE_CURRENTs in contrib... I added it on purpose, so we know to convert these places when we remove LUCENE_CURRENT. We can also do this gradually in LUCENE-2305 . There were some tests that previously passed null as analyzers Those were exactly the cases I wanted to move Analyzer out of the mandatory list. Whatever you pass will work for those tests, because they obviously don't rely on analysis ... otherwise they'd had been hit by NPE. Hmm ... I wonder if I can still pass 'null' for them. I used eclipse search/replace tool to add the Analyzer to IWC calls. The null argument disappeared completely because they used the default, so the search/replace tool replaced that w/ Whitespace. I prefer to keep it that way, to avoid re-changing them again. There are a number of tests where the analyzers were changed ... I thought that that's ok, because those tests don't rely on the output of analysis, or otherwise they should have tested/asserted it. Since they don't, it shouldn't matter to them. But I'll revert that change. So two things I need to do: revert the changes to those tests add warmer to
          Hide
          Michael McCandless added a comment -

          Patch looks good Shai – that was fast adding back in the analyzers

          A few things:

          • Should we also move MergedSegmentWarmer to IWC?
          • This adds more Version.LUCENE_CURRENTs in contrib... but I don't
            see any way around that (and it's good to add the placeholder to
            IWC).
          • There were some tests that previously passed null as analyzers,
            that you changed to WhitespaceAnalyzer, which seems OK.
          • I see you cleaned some stuff up in the process (fixed "extends
            TestCase" -> extends "LuceneTestCase", use TEST_VERSION_CURRENT
            consistently instead using class member, added Version to ctor for
            analyzers that now require them, noting that test is deprecated
            since it only tests deprecated class, fixed places where tests
            passed null as analyzer) – good!
          • There are a number of tests where the analyzers were changed, eg
            Simple -> Whitespace or Standard -> Whitespace – I attached
            check.py (run it w/ 1 arg which is path to the patch file). Note
            that not all things it finds are actually changed (ie it has some
            false pos's, but I think not too many). Tests still pass... but
            it makes me a bit nervous. Shouldn't we just keep the same
            analyzer?
          Show
          Michael McCandless added a comment - Patch looks good Shai – that was fast adding back in the analyzers A few things: Should we also move MergedSegmentWarmer to IWC? This adds more Version.LUCENE_CURRENTs in contrib... but I don't see any way around that (and it's good to add the placeholder to IWC). There were some tests that previously passed null as analyzers, that you changed to WhitespaceAnalyzer, which seems OK. I see you cleaned some stuff up in the process (fixed "extends TestCase" -> extends "LuceneTestCase", use TEST_VERSION_CURRENT consistently instead using class member, added Version to ctor for analyzers that now require them, noting that test is deprecated since it only tests deprecated class, fixed places where tests passed null as analyzer) – good! There are a number of tests where the analyzers were changed, eg Simple -> Whitespace or Standard -> Whitespace – I attached check.py (run it w/ 1 arg which is path to the patch file). Note that not all things it finds are actually changed (ie it has some false pos's, but I think not too many). Tests still pass... but it makes me a bit nervous. Shouldn't we just keep the same analyzer?
          Hide
          Michael McCandless added a comment -

          Thanks Shai! I'll look through it...

          Show
          Michael McCandless added a comment - Thanks Shai! I'll look through it...
          Hide
          Shai Erera added a comment -

          Patch w/ IWC requiring an Analyzer in its ctor. All tests pass

          Show
          Shai Erera added a comment - Patch w/ IWC requiring an Analyzer in its ctor. All tests pass
          Hide
          Shai Erera added a comment -

          Well it'd simplify your work here, for one

          I'm afraid that's too late no? I've already completed that work ... it's that late realization that forces me to re-do it again ...

          I don't mind if IW does not know about Analyzers, I'm perfectly fine w/ it (and even like it).

          Suggestion: since re-doing all that work, meaning adding Whitespace to the instantiation in all the places that really don't care, is very expensive and time consuming on my part, if LUCENE-2309 is resolved before 3.1, can we keep IWC as-is, and as part of 2309 remove setAnalyzer from IWC? It will keep the ctor's signature as it is now, and we won't need to change all of these classes again in 2309. Only those that explicitly call setAnalyzer.

          Is that acceptable?

          Show
          Shai Erera added a comment - Well it'd simplify your work here, for one I'm afraid that's too late no? I've already completed that work ... it's that late realization that forces me to re-do it again ... I don't mind if IW does not know about Analyzers, I'm perfectly fine w/ it (and even like it). Suggestion: since re-doing all that work, meaning adding Whitespace to the instantiation in all the places that really don't care, is very expensive and time consuming on my part, if LUCENE-2309 is resolved before 3.1, can we keep IWC as-is, and as part of 2309 remove setAnalyzer from IWC? It will keep the ctor's signature as it is now, and we won't need to change all of these classes again in 2309. Only those that explicitly call setAnalyzer. Is that acceptable?
          Hide
          Michael McCandless added a comment -

          Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream?

          Well, where all of the fields have pre-defined TokenStream, or, can produce it on demand with no other args (ie that field instance knows its analyzer).

          What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)?

          Well it'd simplify your work here, for one

          But it'd also put distance between IW and analysis, which I think is useful for others to use Lucene with their own "means" of producing tokens. Tying IW to less concrete impls also increases our independence which makes cross-version compatibility easier, over time. The more independent the various parts of Lucene are the more we can factor things out...

          Show
          Michael McCandless added a comment - Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream? Well, where all of the fields have pre-defined TokenStream, or, can produce it on demand with no other args (ie that field instance knows its analyzer). What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)? Well it'd simplify your work here, for one But it'd also put distance between IW and analysis, which I think is useful for others to use Lucene with their own "means" of producing tokens. Tying IW to less concrete impls also increases our independence which makes cross-version compatibility easier, over time. The more independent the various parts of Lucene are the more we can factor things out...
          Hide
          Shai Erera added a comment -

          I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines.

          Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream? What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)?

          Show
          Shai Erera added a comment - I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines. Isn't that like calling IW.addDocument(Doc, Analyzer) where some of the fields have pre-defined TokenStream? What's the difference then? Why change the current API (besides getting rid of the addDoc(Doc) method)?
          Hide
          Michael McCandless added a comment -

          If we did LUCENE-2309 then IWC wouldn't need an Analyzer, default or otherwise

          Show
          Michael McCandless added a comment - If we did LUCENE-2309 then IWC wouldn't need an Analyzer, default or otherwise
          Hide
          Michael McCandless added a comment -

          I opened LUCENE-2309 for decoupling IW from analysis. IW would only need to know about attr source.

          Show
          Michael McCandless added a comment - I opened LUCENE-2309 for decoupling IW from analysis. IW would only need to know about attr source.
          Hide
          Mark Miller added a comment -

          I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines.

          Show
          Mark Miller added a comment - I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines.
          Hide
          Shai Erera added a comment -

          Mike - regarding getting rid of Analyzer on IW. Today it's very convenient that I can specify my default Analyzer for all ANALYZED fields, for all documents. I don't understand how a Field can introduce an Analyzer - do I now set an Analyzer on every field I add to a Document? Wouldn't that be extremely inconvenient?

          If you want to get rid of Analyzer on IW I'm fine. We can throw away the addDocument(Doc) method and always force a user to pass an Analyzer. Much better than requiring him to pass it on every field he adds to a document ...

          Show
          Shai Erera added a comment - Mike - regarding getting rid of Analyzer on IW. Today it's very convenient that I can specify my default Analyzer for all ANALYZED fields, for all documents. I don't understand how a Field can introduce an Analyzer - do I now set an Analyzer on every field I add to a Document? Wouldn't that be extremely inconvenient? If you want to get rid of Analyzer on IW I'm fine. We can throw away the addDocument(Doc) method and always force a user to pass an Analyzer. Much better than requiring him to pass it on every field he adds to a document ...
          Hide
          Shai Erera added a comment -

          what makes it different is that breaking text into tokens and indexing them is what search is all about.

          I agree, and Whitespace is easy to explain and understand. Who said that Standard or Simple is better, for example (I know you don't say it)? I just don't understand why we should force everyone to select an Analyzer even when they don't want to, or care.

          Remember that all the people out there today already set an Analyzer, so when they port to the new API they'll continue setting it (unless they didn't care but were forced to do it). New users ... well I hope they read jdocs.

          It's good that SOLR doesn't force people to specify the Analyzer ... smart decision, for play/example purposes.

          Show
          Shai Erera added a comment - what makes it different is that breaking text into tokens and indexing them is what search is all about. I agree, and Whitespace is easy to explain and understand. Who said that Standard or Simple is better, for example (I know you don't say it)? I just don't understand why we should force everyone to select an Analyzer even when they don't want to, or care. Remember that all the people out there today already set an Analyzer, so when they port to the new API they'll continue setting it (unless they didn't care but were forced to do it). New users ... well I hope they read jdocs. It's good that SOLR doesn't force people to specify the Analyzer ... smart decision, for play/example purposes.
          Hide
          Michael McCandless added a comment -

          Really, eventually, I'd like a stronger separation of analysis and
          indexing, so that IndexWriter never even gets an analyzer.

          Ie, the fields in the doc that need to be indexed should only present
          an attr source, and IW pulls from that.

          IW is then fully agnostic to how that attr source was created – one
          need not even use analyzers/tokenizer/tokenfilter approach
          at all (create something custom).

          IW now has a lot of embedded logic to figure out how to get the attr
          source – not analyzed fields, analyzed but it's a Reader vs a String
          vs pre-analyzed, etc.

          But until then I agree it's dangerous to default the analyzer – user
          should be explicit. It's similar to how we promoted field truncation
          (max field length) to be an explicit choice, because it's a dangerous
          default (because it silently alters what makes it into your index).

          Show
          Michael McCandless added a comment - Really, eventually, I'd like a stronger separation of analysis and indexing, so that IndexWriter never even gets an analyzer. Ie, the fields in the doc that need to be indexed should only present an attr source, and IW pulls from that. IW is then fully agnostic to how that attr source was created – one need not even use analyzers/tokenizer/tokenfilter approach at all (create something custom). IW now has a lot of embedded logic to figure out how to get the attr source – not analyzed fields, analyzed but it's a Reader vs a String vs pre-analyzed, etc. But until then I agree it's dangerous to default the analyzer – user should be explicit. It's similar to how we promoted field truncation (max field length) to be an explicit choice, because it's a dangerous default (because it silently alters what makes it into your index).
          Hide
          Mark Miller added a comment -

          Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one?

          Hmm... SOLR doesn't really use Lucene analyzers.

          It comes with a default Schema.xml that defines FieldTypes. Then field names can be assigned to FieldTypes. So technically speaking, no, Solr does not - but because most
          people build off the example, you could say that it does have defaults for example FieldTypes and defaults of what field names map to those. But it also only accepts certain example fields with the example Schema - you really
          have to go in and customize it to your needs - its setup to basically show off what options are available and work with some demo stuff.

          Solr comes with almost no defaults in a way - but it does ship with an example setup that is meant to show you how to set things up, and what is available. You could consider those defaults since most will build off it.

          example of Solr analyzer declaration:

              <!-- A general unstemmed text field - good if one does not know the language of the field -->
              <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
                <analyzer type="index">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                </analyzer>
                <analyzer type="query">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                  <filter class="solr.StopFilterFactory"
                          ignoreCase="true"
                          words="stopwords.txt"
                          enablePositionIncrements="true"
                          />
                  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                </analyzer>
              </fieldType>
          
          Show
          Mark Miller added a comment - Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one? Hmm... SOLR doesn't really use Lucene analyzers. It comes with a default Schema.xml that defines FieldTypes. Then field names can be assigned to FieldTypes. So technically speaking, no, Solr does not - but because most people build off the example, you could say that it does have defaults for example FieldTypes and defaults of what field names map to those. But it also only accepts certain example fields with the example Schema - you really have to go in and customize it to your needs - its setup to basically show off what options are available and work with some demo stuff. Solr comes with almost no defaults in a way - but it does ship with an example setup that is meant to show you how to set things up, and what is available. You could consider those defaults since most will build off it. example of Solr analyzer declaration: <!-- A general unstemmed text field - good if one does not know the language of the field --> <fieldType name= "textgen" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.StopFilterFactory" ignoreCase= " true " words= "stopwords.txt" enablePositionIncrements= " true " /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" splitOnCaseChange= "0" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.SynonymFilterFactory" synonyms= "synonyms.txt" ignoreCase= " true " expand= " true " /> <filter class= "solr.StopFilterFactory" ignoreCase= " true " words= "stopwords.txt" enablePositionIncrements= " true " /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "0" catenateNumbers= "0" catenateAll= "0" splitOnCaseChange= "0" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
          Hide
          Robert Muir added a comment -

          IW defaults to all other bunch of settings, so why is Analyzer different?

          Just my opinion, but what makes it different is that breaking text into tokens
          and indexing them is what search is all about.

          Deletions is an optional feature, and if someone got rid of it alltogether, I wouldn't
          even comment on the issue, I could care less.

          Show
          Robert Muir added a comment - IW defaults to all other bunch of settings, so why is Analyzer different? Just my opinion, but what makes it different is that breaking text into tokens and indexing them is what search is all about. Deletions is an optional feature, and if someone got rid of it alltogether, I wouldn't even comment on the issue, I could care less.
          Hide
          Michael McCandless added a comment -

          I'll revert the commit while we discuss...

          Show
          Michael McCandless added a comment - I'll revert the commit while we discuss...
          Hide
          Shai Erera added a comment -

          This doesn't help much Mark .

          Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one?

          Show
          Shai Erera added a comment - This doesn't help much Mark . Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one?
          Hide
          Mark Miller added a comment -

          If we say Analyzer is mandatory, what will stop us tomorrow from saying IndexDeletionPolicy is mandatory?

          Nothing But I think Analyzer should be mandatory and that IndexDeletionPolicy should not be mandatory, looking at them case by case.

          Show
          Mark Miller added a comment - If we say Analyzer is mandatory, what will stop us tomorrow from saying IndexDeletionPolicy is mandatory? Nothing But I think Analyzer should be mandatory and that IndexDeletionPolicy should not be mandatory, looking at them case by case.
          Hide
          Shai Erera added a comment -

          I disagree, but obviously I'm on the minority side. It is clearly documented what's the default Analyzer used is, and that you should change it if you want to get more meaningful analysis. When I created IWC I really wanted to simplify how IW is created. If we force IWC to accept both a Version AND an Analyzer, instantiating IW will look like this: new IndexWriter(dir, new IndexWriterConfig(matchVersion, analyzer));

          We don't accomplish anything with it, took away the MFL argument and replace it w/ IWC ... Remember - those that used to set all kind of parameters, using the other ctors, anyway care about how their IW is instantiated. The others that used IW(dir, analyzer, MFL) don't care about all other attributes. MFL is just annoyance, so we removed it. I just don't feel that a default Analyzer, which is Whitespace, is bad. It's easy to understand what your analysis looks like, and since it's well documented, nobody can say "hey, what didn't you warn me". IW defaults to all other bunch of settings, so why is Analyzer different?

          If we say Analyzer is mandatory, what will stop us tomorrow from saying IndexDeletionPolicy is mandatory? And then we'll get into whole bunch of ctors, only now on IWC? If we're documenting things clearly, and IWC documents clearly all its defaults, I see no reason why to require an Analyzer to be specified up front. At least to me, that will make this entire change useless. When I create my IW for serious indexing, I take care of all its settings. Otherwise I just instantiate it to check something completely not related to its defaults. If I test those, I define them (otherwise I cannot test them).

          Show
          Shai Erera added a comment - I disagree, but obviously I'm on the minority side. It is clearly documented what's the default Analyzer used is, and that you should change it if you want to get more meaningful analysis. When I created IWC I really wanted to simplify how IW is created. If we force IWC to accept both a Version AND an Analyzer, instantiating IW will look like this: new IndexWriter(dir, new IndexWriterConfig(matchVersion, analyzer)); We don't accomplish anything with it, took away the MFL argument and replace it w/ IWC ... Remember - those that used to set all kind of parameters, using the other ctors, anyway care about how their IW is instantiated. The others that used IW(dir, analyzer, MFL) don't care about all other attributes. MFL is just annoyance, so we removed it. I just don't feel that a default Analyzer, which is Whitespace, is bad. It's easy to understand what your analysis looks like, and since it's well documented, nobody can say "hey, what didn't you warn me". IW defaults to all other bunch of settings, so why is Analyzer different? If we say Analyzer is mandatory, what will stop us tomorrow from saying IndexDeletionPolicy is mandatory? And then we'll get into whole bunch of ctors, only now on IWC? If we're documenting things clearly, and IWC documents clearly all its defaults, I see no reason why to require an Analyzer to be specified up front. At least to me, that will make this entire change useless. When I create my IW for serious indexing, I take care of all its settings. Otherwise I just instantiate it to check something completely not related to its defaults. If I test those, I define them (otherwise I cannot test them).
          Hide
          Robert Muir added a comment -

          Here's my elaboration a bit:

          I like the concept of good, versioned defaults, but i think this is just a decision point that
          a user should have to make, like it is now.

          A practical concern that I have: Whitespace really sucks, we will get lots of questions
          on java-user because "lucene doesnt search upper/lowercase by default". This is because
          beginner users will only do what work is necessary for their application to compile.

          Show
          Robert Muir added a comment - Here's my elaboration a bit: I like the concept of good, versioned defaults, but i think this is just a decision point that a user should have to make, like it is now. A practical concern that I have: Whitespace really sucks, we will get lots of questions on java-user because "lucene doesnt search upper/lowercase by default". This is because beginner users will only do what work is necessary for their application to compile.
          Hide
          Michael McCandless added a comment -

          Robert raised a concern on java-dev: maybe we should not have a default analyzer....? Ie force the user to make an explicit choice.

          Show
          Michael McCandless added a comment - Robert raised a concern on java-dev: maybe we should not have a default analyzer....? Ie force the user to make an explicit choice.
          Hide
          Michael McCandless added a comment -

          Phew! Thanks Shai...

          Show
          Michael McCandless added a comment - Phew! Thanks Shai...
          Hide
          Michael McCandless added a comment -

          Patch looks good – thanks Shai!

          I'll commit in a day or two.

          Show
          Michael McCandless added a comment - Patch looks good – thanks Shai! I'll commit in a day or two.
          Hide
          Shai Erera added a comment -

          Patch with all tests and code converted to not use the deprecated API, and move to use IWC. All except CreateIndexTask in benchmark, which looks a bit more complicated to change, I left it out for now.

          Another thing - I deprecated *QueryParserWrapperTest and did not convert them to use the new IWC API, as those tests will be removed when *QueryParserWrapper will be removed.

          All tests pass.

          Show
          Shai Erera added a comment - Patch with all tests and code converted to not use the deprecated API, and move to use IWC. All except CreateIndexTask in benchmark, which looks a bit more complicated to change, I left it out for now. Another thing - I deprecated *QueryParserWrapperTest and did not convert them to use the new IWC API, as those tests will be removed when *QueryParserWrapper will be removed. All tests pass.
          Hide
          Shai Erera added a comment -

          While I'm converting the tests to use the new IWC, and getting rid of deprecated methods, I think it'll be useful to add a MergePolicyConfig. But since most of the setters used today assume a LogMergePolicy, and since it's also unrelated to IW, I prefer to leave it for a separate issue. Doing that, w/ the methods chaining approach, will simplify usage even more.

          Show
          Shai Erera added a comment - While I'm converting the tests to use the new IWC, and getting rid of deprecated methods, I think it'll be useful to add a MergePolicyConfig. But since most of the setters used today assume a LogMergePolicy, and since it's also unrelated to IW, I prefer to leave it for a separate issue. Doing that, w/ the methods chaining approach, will simplify usage even more.
          Hide
          Shai Erera added a comment -

          UNLIMITED_FIELD_LENGTH

          I deliberately named it like that. There is currently IW.DEFAULT_MAX_FIELD_LENGTH which is set to 10,000. I didn't want to keep the same name so as to not confuse. The default, UNLIMITED, is documented on setMaxFieldLength.

          But like you said, it may go entirely away in LUCENE-2295.

          Annoying we moved IW as required into MP

          I think I'm to blame for this .

          How about making IW also do multiple lines for its non-IWC setting

          Done.

          Thanks for the comments. Continuing with converting the code. I'm flying to the US tonight, so I may not finish it before I leave. And then I'll be gone for ~30 hours before I can ping again .

          Show
          Shai Erera added a comment - UNLIMITED_FIELD_LENGTH I deliberately named it like that. There is currently IW.DEFAULT_MAX_FIELD_LENGTH which is set to 10,000. I didn't want to keep the same name so as to not confuse. The default, UNLIMITED, is documented on setMaxFieldLength. But like you said, it may go entirely away in LUCENE-2295 . Annoying we moved IW as required into MP I think I'm to blame for this . How about making IW also do multiple lines for its non-IWC setting Done. Thanks for the comments. Continuing with converting the code. I'm flying to the US tonight, so I may not finish it before I leave. And then I'll be gone for ~30 hours before I can ping again .
          Hide
          Michael McCandless added a comment -

          OK let's keep the semantics of "if you pass null that means restore to default".

          Should we rename UNLIMITED_FIELD_LENGTH -> DEFAULT_MAX_FIELD_LENGTH? (Though w/ the new issue this should be going away from IWC anyway).

          IWC.toString() includes '\n' between settings

          Ahh, I misread it (I thought IW was doing \n followed by single space). I agree multiple lines is much better. How about making IW also do multiple lines for its non-IWC settings that are messaged? And then removing the config= from the output? Ie all I see is a bunch of settings. I don't think [in this debug printout] that we need to message which setting was on IW and which was from the IWC.

          I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM).

          Ahh, sorry, right. Annoying we moved IW as required into MP Hrmph.

          I think I'll just clone the incoming IWC on IW and leave the rest as-is?

          OK, I think this is fine, for starters. The API semantics are clearly defined, so, if at a later date we harden the enforcement (say throw IllegalStateEx if you try to change anything after IWC has been consumed by IW, that's fair game).

          Maybe clearly state in the jdocs that presently no settings are version dependent, so it's just a placeholder? (Basically, elevate the code comment that already says this, into the jdocs). Whenever a class in Lucene takes a Version, we should strongly document how Version alters settings, even if it's a no-op placeholder. (And I agree we should have it here as placeholder).

          Patch looks good Shai, thanks!

          Show
          Michael McCandless added a comment - OK let's keep the semantics of "if you pass null that means restore to default". Should we rename UNLIMITED_FIELD_LENGTH -> DEFAULT_MAX_FIELD_LENGTH? (Though w/ the new issue this should be going away from IWC anyway). IWC.toString() includes '\n' between settings Ahh, I misread it (I thought IW was doing \n followed by single space). I agree multiple lines is much better. How about making IW also do multiple lines for its non-IWC settings that are messaged? And then removing the config= from the output? Ie all I see is a bunch of settings. I don't think [in this debug printout] that we need to message which setting was on IW and which was from the IWC. I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM). Ahh, sorry, right. Annoying we moved IW as required into MP Hrmph. I think I'll just clone the incoming IWC on IW and leave the rest as-is? OK, I think this is fine, for starters. The API semantics are clearly defined, so, if at a later date we harden the enforcement (say throw IllegalStateEx if you try to change anything after IWC has been consumed by IW, that's fair game). Maybe clearly state in the jdocs that presently no settings are version dependent, so it's just a placeholder? (Basically, elevate the code comment that already says this, into the jdocs). Whenever a class in Lucene takes a Version, we should strongly document how Version alters settings, even if it's a no-op placeholder. (And I agree we should have it here as placeholder). Patch looks good Shai, thanks!
          Hide
          Shai Erera added a comment -

          Patch with applied comments. I'll start moving the code to use the new ctor, class etc.

          Show
          Shai Erera added a comment - Patch with applied comments. I'll start moving the code to use the new ctor, class etc.
          Hide
          Shai Erera added a comment -

          Hmm... I think we should still allow package private specification of the indexing chain?

          Ok, I'll add it back.

          For IWC.setAnalyzer(null)

          I preferred not to (it's not just setAnalyzer, but also MS, Similarity and IndexDeletionPolicy). I specifically documented on each what happens if one passes null. I think it's better service - you're not expected to pass null, and so if you do, instead of throwing an exception, we revert to default. I thought at some point to add a restoreDefaults() method, but then realized that doing new IWC() is not that expensive ...
          I think - if someone uses IWC but receives any of those settings from the outside, instead of asking him to always check for null, we tell him "pass null, we revert to default" ...

          In IWC we call it "scheduler"

          Woops, fixed that !

          Why can't MergePolicy also live in IWC?

          I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM).

          IW.messageState will have a space before each of its entries

          IWC.toString() includes '\n' between settings. I thought it'd be more readable that way because otherwise it'd be a long line. The output for me looks like that:

          IW 0 [main]: dir=org.apache.lucene.store.RAMDirectory@2ca22ca2 mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@6ca06ca index= version=3.1-dev config=matchVersion=LUCENE_31
          analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer
          delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
          commit=null
          openMode=CREATE_OR_APPEND
          maxFieldLength=2147483647
          similarity=org.apache.lucene.search.DefaultSimilarity
          termIndexInterval=128
          mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler
          default WRITE_LOCK_TIMEOUT=1000
          writeLockTimeout=1000
          maxBufferedDeleteTerms=-1
          ramBufferSizeMB=16.0
          maxBufferedDocs=-1
          

          Perhaps I'll print "config=\n" so that all config parameters start on the new line, and not all but the first. Is that acceptable, or you still prefer all to be on one line, space separated?

          Need small jdoc for the new IW ctor

          I'll add that. That one slipped .

          Should we disallow all setters on IWC after it's been consumed by an IW?

          I've thought about all the options that you raise, and decided to keep the situation as-is, to let others also comment on that. So I'm glad you commented .

          • I've documented on IW.getConfig() that setting anything on the returned object has no effect on IW instance, and if one needs to do it, one should re-instantiate IW.
          • I also thought to turn off setters (setting IWC to read-only by IW), but then someone won't be able to reuse an IWC
          • Removing the fields from IW is not good, because then someone could really call getConfig().setMaxBufferedDocs and that will affect IW, unlike what the comment says. If we want to really keep everything in IWC, we need to clone on ctor and getConfig() time. Seems a waste to me.

          I think I'll just clone the incoming IWC on IW and leave the rest as-is?

          Show
          Shai Erera added a comment - Hmm... I think we should still allow package private specification of the indexing chain? Ok, I'll add it back. For IWC.setAnalyzer(null) I preferred not to (it's not just setAnalyzer, but also MS, Similarity and IndexDeletionPolicy). I specifically documented on each what happens if one passes null. I think it's better service - you're not expected to pass null, and so if you do, instead of throwing an exception, we revert to default. I thought at some point to add a restoreDefaults() method, but then realized that doing new IWC() is not that expensive ... I think - if someone uses IWC but receives any of those settings from the outside, instead of asking him to always check for null, we tell him "pass null, we revert to default" ... In IWC we call it "scheduler" Woops, fixed that ! Why can't MergePolicy also live in IWC? I've commented on it in this issue - MP requires an IW instance to be passed to its ctor.(see my comment from 04/Mar/10 09:28 PM). IW.messageState will have a space before each of its entries IWC.toString() includes '\n' between settings. I thought it'd be more readable that way because otherwise it'd be a long line. The output for me looks like that: IW 0 [main]: dir=org.apache.lucene.store.RAMDirectory@2ca22ca2 mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@6ca06ca index= version=3.1-dev config=matchVersion=LUCENE_31 analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy commit= null openMode=CREATE_OR_APPEND maxFieldLength=2147483647 similarity=org.apache.lucene.search.DefaultSimilarity termIndexInterval=128 mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler default WRITE_LOCK_TIMEOUT=1000 writeLockTimeout=1000 maxBufferedDeleteTerms=-1 ramBufferSizeMB=16.0 maxBufferedDocs=-1 Perhaps I'll print "config=\n" so that all config parameters start on the new line, and not all but the first. Is that acceptable, or you still prefer all to be on one line, space separated? Need small jdoc for the new IW ctor I'll add that. That one slipped . Should we disallow all setters on IWC after it's been consumed by an IW? I've thought about all the options that you raise, and decided to keep the situation as-is, to let others also comment on that. So I'm glad you commented . I've documented on IW.getConfig() that setting anything on the returned object has no effect on IW instance, and if one needs to do it, one should re-instantiate IW. I also thought to turn off setters (setting IWC to read-only by IW), but then someone won't be able to reuse an IWC Removing the fields from IW is not good, because then someone could really call getConfig().setMaxBufferedDocs and that will affect IW, unlike what the comment says. If we want to really keep everything in IWC, we need to clone on ctor and getConfig() time. Seems a waste to me. I think I'll just clone the incoming IWC on IW and leave the rest as-is?
          Hide
          Michael McCandless added a comment -

          Ha, I see TODO 4.0 comments now. Sheesh

          Patch looks great Shai!

          A few comments:

          • Hmm... I think we should still allow package private specification
            of the indexing chain? Not long ago Michael B. explicitly opened
            this up (while keeping it package private). I think for
            way-expert users we should keep it extensible?
          • For IWC.setAnalyzer(null), instead of silently swapping in
            WhitespaceAnalyzer, can we throw IllegalArgE?
          • Should we disallow all setters on IWC after it's been consumed by
            an IW? (Though one may want to re-use to make another IW).
            Or... make it very clear that IW fully consumes the IWC on
            creation and never again looks at it...? (OK, I do see this in
            IW.getConfig and IWC). Should IW just clone the IWC and reference
            it internally, instead of copying to private fields? Or... IW
            marks the IWC readOnly (and setters on it then fail), and you must
            clone it if you will re-use?
          • In IWC we call it "scheduler" but seems like we should keep the
            name "mergeScheduler"? (And in toString, too).
          • Why can't MergePolicy also live in IWC? Seems odd to move
            MergeScheduler to it but not MP. (And deprecate set/getMP in IW)
          • Nit-picky, but, IW.messageState will have a space before each of
            its entries but then IWC's entries have no space – looks not so
            nice on enabling infoStream?
          • Need small jdoc for the new IW ctor
          Show
          Michael McCandless added a comment - Ha, I see TODO 4.0 comments now. Sheesh Patch looks great Shai! A few comments: Hmm... I think we should still allow package private specification of the indexing chain? Not long ago Michael B. explicitly opened this up (while keeping it package private). I think for way-expert users we should keep it extensible? For IWC.setAnalyzer(null), instead of silently swapping in WhitespaceAnalyzer, can we throw IllegalArgE? Should we disallow all setters on IWC after it's been consumed by an IW? (Though one may want to re-use to make another IW). Or... make it very clear that IW fully consumes the IWC on creation and never again looks at it...? (OK, I do see this in IW.getConfig and IWC). Should IW just clone the IWC and reference it internally, instead of copying to private fields? Or... IW marks the IWC readOnly (and setters on it then fail), and you must clone it if you will re-use? In IWC we call it "scheduler" but seems like we should keep the name "mergeScheduler"? (And in toString, too). Why can't MergePolicy also live in IWC? Seems odd to move MergeScheduler to it but not MP. (And deprecate set/getMP in IW) Nit-picky, but, IW.messageState will have a space before each of its entries but then IWC's entries have no space – looks not so nice on enabling infoStream? Need small jdoc for the new IW ctor
          Hide
          Shai Erera added a comment -

          Patch includes:

          • IndexWriterConfig + Test
          • Changes to IndexWriter to use IWC
          • Changes to DocumentsWriter - removed IndexingChain from its ctor and always use default.

          IndexingChain is package-private and so was the ctor on IndexWriter which allowed to set it. If that's important (e.g. used by SOLR maybe?), then I can add a package-private setting to IndexWriterConfig which will allow setting this. But since I haven't found any calls in Lucene (tests nor java), and no implementations of this class besides the default that is in DW, I thought this can be removed.

          I would like to have this patched reviewed, before I go about changing all the code to use the new IWC (tests + java). If we agree on the details and how it looks and used, I'll do it - should be straightforward.

          Show
          Shai Erera added a comment - Patch includes: IndexWriterConfig + Test Changes to IndexWriter to use IWC Changes to DocumentsWriter - removed IndexingChain from its ctor and always use default. IndexingChain is package-private and so was the ctor on IndexWriter which allowed to set it. If that's important (e.g. used by SOLR maybe?), then I can add a package-private setting to IndexWriterConfig which will allow setting this. But since I haven't found any calls in Lucene (tests nor java), and no implementations of this class besides the default that is in DW, I thought this can be removed. I would like to have this patched reviewed, before I go about changing all the code to use the new IWC (tests + java). If we agree on the details and how it looks and used, I'll do it - should be straightforward.
          Hide
          Shai Erera added a comment -

          Exactly !

          Only I hope that if people will interact w/ their MP to set stuff on it, they would interact with it to query its settings. Calling IW.getMergePolicy will become redundant, or at least will be avoidable. Maybe only when you rely on the default MP, you'll perform the last code example you gave above. Actually, I've already done 98% of the change. I need to finish some stuff on IW and then I'll post the patch.

          On the MFL/Analyzer - I like the approach of wrapping an Analyzer with a MFLAnalyzer. The beauty is that if I don't want to limit anything, I don't need to do anything and the code won't need to keep track of how many tokens were indexed for that field ... I still think it should be done in a separate issue, as it involves introducing some code in the lower levels to wrap any Analyzer with it, for back-compat. If we do it, let's do it before 3.1 is out, so we can remove it from IWC right away w/o deprecating anything ...

          I'll open an issue for it.

          Show
          Shai Erera added a comment - Exactly ! Only I hope that if people will interact w/ their MP to set stuff on it, they would interact with it to query its settings. Calling IW.getMergePolicy will become redundant, or at least will be avoidable. Maybe only when you rely on the default MP, you'll perform the last code example you gave above. Actually, I've already done 98% of the change. I need to finish some stuff on IW and then I'll post the patch. On the MFL/Analyzer - I like the approach of wrapping an Analyzer with a MFLAnalyzer. The beauty is that if I don't want to limit anything, I don't need to do anything and the code won't need to keep track of how many tokens were indexed for that field ... I still think it should be done in a separate issue, as it involves introducing some code in the lower levels to wrap any Analyzer with it, for back-compat. If we do it, let's do it before 3.1 is out, so we can remove it from IWC right away w/o deprecating anything ... I'll open an issue for it.
          Hide
          Michael McCandless added a comment -

          So this will affect these methods in IW, right:

          get/setMergeFactor(...)
          get/setUseCompoundFile(...)
          get/setMaxMergeDocs(...)
          

          Ie, we'd deprecate them. Instead you'd have to interact /w the MergePolicy. IW will still default its merge policy, so with this change, instead of:

          writer.setMergeFactor(7);
          

          You'd have to do:

          ((LogMergePolicy) writer.getMergePolicy()).setMergeFactor(7));
          

          (And take the risk of runtime cast yourself/explicitly, which IW was already doing under the hood anyway).

          Show
          Michael McCandless added a comment - So this will affect these methods in IW, right: get/setMergeFactor(...) get/setUseCompoundFile(...) get/setMaxMergeDocs(...) Ie, we'd deprecate them. Instead you'd have to interact /w the MergePolicy. IW will still default its merge policy, so with this change, instead of: writer.setMergeFactor(7); You'd have to do: ((LogMergePolicy) writer.getMergePolicy()).setMergeFactor(7)); (And take the risk of runtime cast yourself/explicitly, which IW was already doing under the hood anyway).
          Hide
          Shai Erera added a comment -

          Ok, then I'll proceed w/ #3.

          Show
          Shai Erera added a comment - Ok, then I'll proceed w/ #3.
          Hide
          Yonik Seeley added a comment -

          Yay, we'll be able to remove SolrIndexConfig and use this

          Show
          Yonik Seeley added a comment - Yay, we'll be able to remove SolrIndexConfig and use this
          Hide
          Earwin Burrfoot added a comment -

          I voted for killing these delegating methods some time ago. It ended in nothing, so I vote again, #3

          Show
          Earwin Burrfoot added a comment - I voted for killing these delegating methods some time ago. It ended in nothing, so I vote again, #3
          Hide
          Shai Erera added a comment -

          There are some settings on IW that go directly to the MergePolicy used (well ... only if it's LogMergePolicy). At first I wanted to move them, together w/ MP, to IWC, but this isn't possible, because MP requires IW to be passed to its ctor, and when IWC is created, there is no IW ... So there are couple of options I can think of:

          • Add all to IWC, and expose on MP a final setIndexWriter, package-private, as well as an empty ctor. IW will later set MP's IW. It will also allow applications to pass their own MP after IW has been created (like they can do today). IWC will delegate all set/get to MP, like IW does today. The only downside is that the IW member of MP won't be final anymore ...
          • Keep the get/set on IW, like it is today. It will split the configuration of IW to two places ... I prefer to keep it all in one place.
          • Remove all the set/get from IW and let applications interface directly with MP. It will remove the bizarre situation where someone could pass a non LogMergePolicy to IW, however when he'll try to set anything on it from IW, it will fail ... If we remove the get/set from IW and let folks interface w/ their MP directly, we won't have this problem:
            • They'll need to declare the MP of type LogMP, or otherwise the API won't be there.
            • if they use their own MP, with special API, they'll interface w/ MP's configuration from one place ...

          Between the three, I like (3) the best as it will get rid of the inconsistency in IW today (expecting only LogMP, but requiring MP). But if those set/get are important to keep together with the other configuration, I prefer (1). What do you think?

          Show
          Shai Erera added a comment - There are some settings on IW that go directly to the MergePolicy used (well ... only if it's LogMergePolicy). At first I wanted to move them, together w/ MP, to IWC, but this isn't possible, because MP requires IW to be passed to its ctor, and when IWC is created, there is no IW ... So there are couple of options I can think of: Add all to IWC, and expose on MP a final setIndexWriter, package-private, as well as an empty ctor. IW will later set MP's IW. It will also allow applications to pass their own MP after IW has been created (like they can do today). IWC will delegate all set/get to MP, like IW does today. The only downside is that the IW member of MP won't be final anymore ... Keep the get/set on IW, like it is today. It will split the configuration of IW to two places ... I prefer to keep it all in one place. Remove all the set/get from IW and let applications interface directly with MP. It will remove the bizarre situation where someone could pass a non LogMergePolicy to IW, however when he'll try to set anything on it from IW, it will fail ... If we remove the get/set from IW and let folks interface w/ their MP directly, we won't have this problem: They'll need to declare the MP of type LogMP, or otherwise the API won't be there. if they use their own MP, with special API, they'll interface w/ MP's configuration from one place ... Between the three, I like (3) the best as it will get rid of the inconsistency in IW today (expecting only LogMP, but requiring MP). But if those set/get are important to keep together with the other configuration, I prefer (1). What do you think?
          Hide
          Michael McCandless added a comment -

          Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ..

          Ahh sorry right. In the "olden days", 10000 was the default.

          we could make a TokenFilter to do this?

          I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?).

          Hmm yeah quite a hassle to fix all analyzers. Hmmm.

          I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter?

          Maybe one solution is to wrap any other analyzer? Ie, create a StopAfterNTokensAnalyzer, taking another analyzer that it delegates to, and then sticking on this StopAfterNTokensFilter to each token stream.

          But yeah maybe break this out as a separate issue...

          Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ...

          Actually it ought to be 0 terms wasted, with the filter @ the end – with this StopAfterNTokensFilter, it'll immediately return false w/o asking for the 10001th token.

          One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future.

          +1

          Show
          Michael McCandless added a comment - Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue .. Ahh sorry right. In the "olden days", 10000 was the default. we could make a TokenFilter to do this? I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). Hmm yeah quite a hassle to fix all analyzers. Hmmm. I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter? Maybe one solution is to wrap any other analyzer? Ie, create a StopAfterNTokensAnalyzer, taking another analyzer that it delegates to, and then sticking on this StopAfterNTokensFilter to each token stream. But yeah maybe break this out as a separate issue... Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ... Actually it ought to be 0 terms wasted, with the filter @ the end – with this StopAfterNTokensFilter, it'll immediately return false w/o asking for the 10001th token. One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future. +1
          Hide
          Shai Erera added a comment -

          I wouldn't worry about what's required - Directory will be left out, MFL is useless and a pain anyway, so what's left is Analyzer. I can put Analyzer on IWC's ctor, but I personally think we can default to a simple one (such as Whitespace) encouraging the people to set their own. I find it very annoying today when I want to test something about IW and I need to pass all these things to IW ...

          The way I see it, those who want to rely on Lucene's latest and greatest can just do: IndexWriter writer = new IndexWriter(dir, new IWC()); Well maybe except for the Analyzer, but I really don't think it matters that much. And like you wrote, someone can chain the setters. So win-win? If you don't care about anything, just wants to open a writer, index something and that's it, you don't need to specify anything .. otherwise you just chain calls?

          One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future.

          Show
          Shai Erera added a comment - I wouldn't worry about what's required - Directory will be left out, MFL is useless and a pain anyway, so what's left is Analyzer. I can put Analyzer on IWC's ctor, but I personally think we can default to a simple one (such as Whitespace) encouraging the people to set their own. I find it very annoying today when I want to test something about IW and I need to pass all these things to IW ... The way I see it, those who want to rely on Lucene's latest and greatest can just do: IndexWriter writer = new IndexWriter(dir, new IWC()); Well maybe except for the Analyzer, but I really don't think it matters that much. And like you wrote, someone can chain the setters. So win-win? If you don't care about anything, just wants to open a writer, index something and that's it, you don't need to specify anything .. otherwise you just chain calls? One thing I should add to IWC so far (I hope to post a patch even today) is a Version parameter. For now it will be ignored, but as a placeholder to change settings in the future.
          Hide
          Mark Miller added a comment - - edited

          I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them.

          Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats required to set on the config class - without the same cstr business ...

          edit

          Though I suppose the chaining does makes this more swallowable...

          new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ...

          Show
          Mark Miller added a comment - - edited I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats required to set on the config class - without the same cstr business ... edit Though I suppose the chaining does makes this more swallowable... new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ...
          Hide
          Shai Erera added a comment -

          But, if we change the default to UNLIMITED

          Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ...

          we could make a TokenFilter to do this?

          I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ...

          I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter?

          Show
          Shai Erera added a comment - But, if we change the default to UNLIMITED Today there is no DEFAULT .. IW forces you to pass MFL so whoever moves to the new API can define whatever he wants. We'll default to UNLIMITED but there won't be any back-compat issue ... we could make a TokenFilter to do this? I'm afraid that will result in changing all Analyzers to work properly? Or you mean DW (or somewhere) will wrap whatever TS an Analyzer returns w/ this filter? That could work, but as soon as that becomes a filter, people may use it, and wrapping their TS w/ that filter will be unnecessary (and slow 'em down?). Also, if I'd use such a filter myself, I wouldn't put it last in the chain, so that I can avoid doing any processing on a term that is not going to end up in the index. Although that's not too critical because I'll be doing this for just one term ... I guess I'd like to keep it as it is now, not turning the issue into a bigger thing ... and a filter alone won't solve it - we'd still need to provide a way to configure it, or otherwise everyone will need to wrap their Analyzers with such filter?
          Hide
          Michael McCandless added a comment -

          +1 – this is great!

          I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength).

          This is because it's a dangerous setting (you silently lose content
          while indexing), a trap. So we want to force the user to make the
          choice, up front, so they realize the implications.

          But, if we change the default to UNLIMITED (which we should do under
          Version), then I agree you should not have to specify it.

          In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens

          I like that approach – we could make a TokenFilter to do this? Then
          we don't need MFL at all in IWC (and deprecate in IW).

          I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND

          +1

          I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards.

          +1 in general, though we should go setting by setting to confirm this is OK. I
          don't know of "real" use cases where apps eg want to change RAM buffer
          or mergeFactor... but maybe there are some interesting usages out
          there.

          Show
          Michael McCandless added a comment - +1 – this is great! I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). This is because it's a dangerous setting (you silently lose content while indexing), a trap. So we want to force the user to make the choice, up front, so they realize the implications. But, if we change the default to UNLIMITED (which we should do under Version), then I agree you should not have to specify it. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens I like that approach – we could make a TokenFilter to do this? Then we don't need MFL at all in IWC (and deprecate in IW). I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND +1 I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. +1 in general, though we should go setting by setting to confirm this is OK. I don't know of "real" use cases where apps eg want to change RAM buffer or mergeFactor... but maybe there are some interesting usages out there.
          Hide
          Shai Erera added a comment -

          I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. I don't see any reason why would anyone change the RAM limit, Similarity etc while IW is running. What's the advantage vs. say close the current IW and open a new one with the different settings? I know the latter is more expensive, and I write it deliberately - I think those settings are really ctor-only settings. Otherwise you might get inconsistent documents (like changing the Similarity or max field length).

          This will also simplify IWC, because now I need to distinguish between settings that cannot be altered afterwards, like changing IndexDeletionPolicy, create, IndexCommit, Analyzer ... if IWC will be a ctor only object, I can have only the default ctor (to init to default settings) and provide the setters otherwise.

          Any objections?

          Show
          Shai Erera added a comment - I'm thinking to make this whole IWC a constructor only parameter to IW, without the ability to set it afterwards. I don't see any reason why would anyone change the RAM limit, Similarity etc while IW is running. What's the advantage vs. say close the current IW and open a new one with the different settings? I know the latter is more expensive, and I write it deliberately - I think those settings are really ctor-only settings. Otherwise you might get inconsistent documents (like changing the Similarity or max field length). This will also simplify IWC, because now I need to distinguish between settings that cannot be altered afterwards, like changing IndexDeletionPolicy, create, IndexCommit, Analyzer ... if IWC will be a ctor only object, I can have only the default ctor (to init to default settings) and provide the setters otherwise. Any objections?
          Hide
          Shai Erera added a comment -

          IndexingChain is one of the things that can be set on IW, however I don't see any implementations of it besides the default, and the class itself is package-private, so no app could actually set it on IW (unless it puts its code under o.a.l.index). Therefore I'm thinking of not introducing it on IWC, or turn it to a public class?
          Is it really something we expect any application out there to set, or can we simply make DocsWriter impl one for itself internally, and don't declare this class as abstract etc.?

          Show
          Shai Erera added a comment - IndexingChain is one of the things that can be set on IW, however I don't see any implementations of it besides the default, and the class itself is package-private, so no app could actually set it on IW (unless it puts its code under o.a.l.index). Therefore I'm thinking of not introducing it on IWC, or turn it to a public class? Is it really something we expect any application out there to set, or can we simply make DocsWriter impl one for itself internally, and don't declare this class as abstract etc.?
          Hide
          Shai Erera added a comment -

          I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND. The current meaning of create is a bit unclear. I.e. if it is true, then overwrite. But if it is false, don't attempt to create, but just open an existing one. However if the directory is empty, it throws an exception. I think an enum would someone to pass CREATE_OR_APPEND in case he doesn't know if there is an index there ... but I don't want to complicate things unnecessarily ... what do you think?

          Show
          Shai Erera added a comment - I was wondering if perhaps instead of allowing to pass a create=true/false, we should use an enum with 3 values: CREATE, APPEND, CREATE_OR_APPEND. The current meaning of create is a bit unclear. I.e. if it is true, then overwrite. But if it is false, don't attempt to create, but just open an existing one. However if the directory is empty, it throws an exception. I think an enum would someone to pass CREATE_OR_APPEND in case he doesn't know if there is an index there ... but I don't want to complicate things unnecessarily ... what do you think?
          Hide
          Shai Erera added a comment -

          Yeah I don't like it either (makes my code unnecessarily long). And I always use UNLIMITED, and the LIMITED=10,000 is really just a guess, and so if anyone wants to limit it, he needs to do new MaxFieldLength(otherLimit) which is unnecessarily long as well ...

          I like it - I'll deprecate on IW and introduce UNLIMITED on IWC.

          Show
          Shai Erera added a comment - Yeah I don't like it either (makes my code unnecessarily long). And I always use UNLIMITED, and the LIMITED=10,000 is really just a guess, and so if anyone wants to limit it, he needs to do new MaxFieldLength(otherLimit) which is unnecessarily long as well ... I like it - I'll deprecate on IW and introduce UNLIMITED on IWC.
          Hide
          Uwe Schindler added a comment -

          In my opinion the whole class is unneeded, so only deprecate in IW but not add it to IWConfig. For me a constant in IWConfig would be enough that defines UNLIMITED and everything else is just an integer. +1 for defaulting to "static final UNLIMITED=Integer.MAX_VALUE". I am not sure why this limitation is there at all. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens. If somebody gets an OOM, he can adjust the value and knows that mayabe some tokens get lost.

          Show
          Uwe Schindler added a comment - In my opinion the whole class is unneeded, so only deprecate in IW but not add it to IWConfig. For me a constant in IWConfig would be enough that defines UNLIMITED and everything else is just an integer. +1 for defaulting to "static final UNLIMITED=Integer.MAX_VALUE". I am not sure why this limitation is there at all. In my opinion it should be left to the apploication to limit the number of tokens if needed, but not silently drop tokens. If somebody gets an OOM, he can adjust the value and knows that mayabe some tokens get lost.
          Hide
          Shai Erera added a comment -

          Thanks Uwe for the pointer.

          I suppose that MaxFieldLength should now move to IndexWriterConfig, only it is public and therefore needs to be deprecated. Otherwise it will look strange that in order to set MFL on IWC you need to reference IW. So deprecate and duplicate? To IW it doesn't matter because it just takes the limit (int) from MFL ...

          Show
          Shai Erera added a comment - Thanks Uwe for the pointer. I suppose that MaxFieldLength should now move to IndexWriterConfig, only it is public and therefore needs to be deprecated. Otherwise it will look strange that in order to set MFL on IWC you need to reference IW. So deprecate and duplicate? To IW it doesn't matter because it just takes the limit (int) from MFL ...
          Hide
          Uwe Schindler added a comment - - edited

          +1 for the IndexWriterConfig with chaining method calls

          We had a discussion about this a while ago on the mailinglist: http://www.lucidimagination.com/search/document/19e1a19f4d340b8c/lucene_2_9_and_deprecated_ir_open_methods

          Show
          Uwe Schindler added a comment - - edited +1 for the IndexWriterConfig with chaining method calls We had a discussion about this a while ago on the mailinglist: http://www.lucidimagination.com/search/document/19e1a19f4d340b8c/lucene_2_9_and_deprecated_ir_open_methods

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development