Lucene - Core
  1. Lucene - Core
  2. LUCENE-4877

Fix analyzer factories to throw exception when arguments are invalid

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, Trunk
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently if someone typos an argument "someParamater=xyz" instead of someParameter=xyz, they get no exception and sometimes incorrect behavior.

      It would be way better if these factories threw exception on unknown params, e.g. they removed the args they used and checked they were empty at the end.

      1. LUCENE-4877_one_solution_prototype.patch
        13 kB
        Robert Muir
      2. LUCENE-4877.patch
        246 kB
        Robert Muir
      3. LUCENE-4877.patch
        493 kB
        Robert Muir
      4. LUCENE-4877.patch
        569 kB
        Robert Muir
      5. LUCENE-4877-steve-minor-fixes.patch
        3 kB
        Steve Rowe
      6. LUCENE-4877-steve-more-param-parsing-verbs.patch
        68 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          Robert Muir added a comment -

          Thanks for cleaning this up Steve, much nicer.

          I'd like to merge these commits back to 4.x if there are no objections.

          Show
          Robert Muir added a comment - Thanks for cleaning this up Steve, much nicer. I'd like to merge these commits back to 4.x if there are no objections.
          Hide
          Steve Rowe added a comment -

          This patch adds more param parsing methods to AbstractAnalysisFactory, including get(), require(), getFloat(), getChar(), and getSet(), and changed all analysis factories to use them where appropriate.

          I don't like the 4-arg required param getXXX() methods in AbstractAnalysisFactory - 4th param as false means required??? - maybe these could be converted to getRequiredXXX() ?

          I implemented these as require(), requireXXX(), etc.

          Tests all pass, and precommit's happy.

          Committing shortly.

          Show
          Steve Rowe added a comment - This patch adds more param parsing methods to AbstractAnalysisFactory, including get(), require(), getFloat(), getChar(), and getSet(), and changed all analysis factories to use them where appropriate. I don't like the 4-arg required param getXXX() methods in AbstractAnalysisFactory - 4th param as false means required??? - maybe these could be converted to getRequiredXXX() ? I implemented these as require(), requireXXX(), etc. Tests all pass, and precommit's happy. Committing shortly.
          Hide
          Robert Muir added a comment -

          Thanks Steve!

          Show
          Robert Muir added a comment - Thanks Steve!
          Hide
          Steve Rowe added a comment -

          Patch fixing these minor nits:

          TestMappingCharFilterFactory's factory could switch to being instantiated using the charFilterFactory() method
          EdgeNgramTokenizerFactory's gram size constants are pulled from EdgeNgramTokenFilter instead of EdgeNgramTokenizer
          LimitTokenCountFilterFactory's maxTokenCount param should be required; this is a pre-existing problem though
          PatternTokenizerFactory's group param should use the getInt() method with a default of -1.

          Actually, I was wrong about LimitTokenCountFilterFactory - it already has a test in place to insure reporting of missing required maxTokenCount param.

          Committing shortly.

          Show
          Steve Rowe added a comment - Patch fixing these minor nits: TestMappingCharFilterFactory's factory could switch to being instantiated using the charFilterFactory() method EdgeNgramTokenizerFactory's gram size constants are pulled from EdgeNgramTokenFilter instead of EdgeNgramTokenizer LimitTokenCountFilterFactory's maxTokenCount param should be required; this is a pre-existing problem though PatternTokenizerFactory's group param should use the getInt() method with a default of -1. Actually, I was wrong about LimitTokenCountFilterFactory - it already has a test in place to insure reporting of missing required maxTokenCount param. Committing shortly.
          Hide
          Robert Muir added a comment -

          getFloat() - well, only one factory (NumericPayloadTokenFilterFactory) could use it now, but maybe add it for completeness?

          I just remembered. I think the reversewildcardfactory in solr has one of these too.

          Show
          Robert Muir added a comment - getFloat() - well, only one factory (NumericPayloadTokenFilterFactory) could use it now, but maybe add it for completeness? I just remembered. I think the reversewildcardfactory in solr has one of these too.
          Hide
          Robert Muir added a comment -

          I committed this to trunk. Can we be extra careful to use LUCENE-4877: in all commit messages so when its time to backport its easy to find the revisions?

          Thanks in advance for any improvements!

          Show
          Robert Muir added a comment - I committed this to trunk. Can we be extra careful to use LUCENE-4877 : in all commit messages so when its time to backport its easy to find the revisions? Thanks in advance for any improvements!
          Hide
          Robert Muir added a comment -

          I have no preference on how to proceed. I just dont want to download such a large patch, modify the sources and upload it again. especially as TortoiseSVN and other clients depend on order of files in filesystem, the order of created patches is different, too. So its impossible to see any change in comparison to earlier patches.

          As we don't intend to release trunk soon: If all tests pass, can you simply commit as a first step, Robert? +1 on the current patch.

          Ok. I'll commit this one to trunk. We can merge back to stable when we are happy.

          We can open further issues to unfuck ResourceLoaderAware (it should be removed, too and the ResourceLoader should be passed to the ctor, too).

          I don't like this either: but it would require more heavy duty stuff in solr to fix that. I also think it would be good to separate that from the goal of throwing exceptions for bogus arguments (as i fixed that for all analysis factories in this patch despite the ResourceLoader issue: all arguments are parsed in ctor and the inform() only does actual loading).

          The challenge in fixing the ResourceLoader stuff is in SolrResourceLoader. when it loads these "aware" classes it only adds them to a list (it defers this), and then at the end calls inform() on everything in the list. I haven't looked into why this is, but yeah i would definitely prefer if this was a followup issue actually.

          I would like this change (throw exception on invalid arguments) to make it back to the stable branch, so I'll keep this issue open until everyone is happy with it. But in general we should try to split up issues for these factories into steps: its not easy to make these sweeping changes unless you either have lots of time or are ok with introducing bugs.

          Show
          Robert Muir added a comment - I have no preference on how to proceed. I just dont want to download such a large patch, modify the sources and upload it again. especially as TortoiseSVN and other clients depend on order of files in filesystem, the order of created patches is different, too. So its impossible to see any change in comparison to earlier patches. As we don't intend to release trunk soon: If all tests pass, can you simply commit as a first step, Robert? +1 on the current patch. Ok. I'll commit this one to trunk. We can merge back to stable when we are happy. We can open further issues to unfuck ResourceLoaderAware (it should be removed, too and the ResourceLoader should be passed to the ctor, too). I don't like this either: but it would require more heavy duty stuff in solr to fix that. I also think it would be good to separate that from the goal of throwing exceptions for bogus arguments (as i fixed that for all analysis factories in this patch despite the ResourceLoader issue: all arguments are parsed in ctor and the inform() only does actual loading). The challenge in fixing the ResourceLoader stuff is in SolrResourceLoader. when it loads these "aware" classes it only adds them to a list (it defers this), and then at the end calls inform() on everything in the list. I haven't looked into why this is, but yeah i would definitely prefer if this was a followup issue actually. I would like this change (throw exception on invalid arguments) to make it back to the stable branch, so I'll keep this issue open until everyone is happy with it. But in general we should try to split up issues for these factories into steps: its not easy to make these sweeping changes unless you either have lots of time or are ok with introducing bugs.
          Hide
          Uwe Schindler added a comment - - edited

          I have no preference on how to proceed. I just dont want to download such a large patch, modify the sources and upload it again. especially as TortoiseSVN and other clients depend on order of files in filesystem, the order of created patches is different, too. So its impossible to see any change in comparison to earlier patches.

          As we don't intend to release trunk soon: If all tests pass, can you simply commit as a first step, Robert? +1 on the current patch. We can open further issues to unfuck ResourceLoaderAware (it should be removed, too and the ResourceLoader should be passed to the ctor, too).

          Show
          Uwe Schindler added a comment - - edited I have no preference on how to proceed. I just dont want to download such a large patch, modify the sources and upload it again. especially as TortoiseSVN and other clients depend on order of files in filesystem, the order of created patches is different, too. So its impossible to see any change in comparison to earlier patches. As we don't intend to release trunk soon: If all tests pass, can you simply commit as a first step, Robert? +1 on the current patch. We can open further issues to unfuck ResourceLoaderAware (it should be removed, too and the ResourceLoader should be passed to the ctor, too).
          Hide
          Robert Muir added a comment -

          Or you can assign the issue. I just have family coming into town today and won't have any time to do any of the work to help out with this one.

          I thought i knew what was involved with this stuff pretty well but I grossly underestimated the amount of work to even throw the exception

          Show
          Robert Muir added a comment - Or you can assign the issue. I just have family coming into town today and won't have any time to do any of the work to help out with this one. I thought i knew what was involved with this stuff pretty well but I grossly underestimated the amount of work to even throw the exception
          Hide
          Steve Rowe added a comment -

          Should we create a branch. I also have some ideas to fix...

          We can, or maybe just use trunk? I don't think the patch makes these factories any worse.

          +1 to commit Robert's patch to trunk and iterate there.

          Show
          Steve Rowe added a comment - Should we create a branch. I also have some ideas to fix... We can, or maybe just use trunk? I don't think the patch makes these factories any worse. +1 to commit Robert's patch to trunk and iterate there.
          Hide
          Robert Muir added a comment -

          Should we create a branch. I also have some ideas to fix...

          We can, or maybe just use trunk? I don't think the patch makes these factories any worse.

          Show
          Robert Muir added a comment - Should we create a branch. I also have some ideas to fix... We can, or maybe just use trunk? I don't think the patch makes these factories any worse.
          Hide
          Robert Muir added a comment -

          I don't like the 4-arg required param getXXX() methods in AbstractAnalysisFactory - 4th param as false means required??? - maybe these could be converted to getRequiredXXX() ?

          I think AbstractAnalysisFactory could use additional param parsing methods:

          get(args, param [, default]) would be a nice addition for strings, instead of args.remove(), which looks different from all the other getXXX() methods. Maybe also a version that takes a set of acceptable values, as well as a boolean for case insensitivity?
          getEnum(args, param, Enum class [, default] ) - probably case insensitivity could be assumed?
          getChar() should be pulled out of PathHierarchyTokenizerFactory, so that DelimitedPayloadTokenFilterFactory can use it.
          getFloat() - well, only one factory (NumericPayloadTokenFilterFactory) could use it now, but maybe add it for completeness?

          A few nits:

          TestMappingCharFilterFactory's factory could switch to being instantiated using the charFilterFactory() method
          EdgeNgramTokenizerFactory's gram size constants are pulled from EdgeNgramTokenFilter instead of EdgeNgramTokenizer
          LimitTokenCountFilterFactory's maxTokenCount param should be required; this is a pre-existing problem though
          PatternTokenizerFactory's group param should use the getInt() method with a default of -1.

          +1 to all of this!

          Show
          Robert Muir added a comment - I don't like the 4-arg required param getXXX() methods in AbstractAnalysisFactory - 4th param as false means required??? - maybe these could be converted to getRequiredXXX() ? I think AbstractAnalysisFactory could use additional param parsing methods: get(args, param [, default] ) would be a nice addition for strings, instead of args.remove(), which looks different from all the other getXXX() methods. Maybe also a version that takes a set of acceptable values, as well as a boolean for case insensitivity? getEnum(args, param, Enum class [, default] ) - probably case insensitivity could be assumed? getChar() should be pulled out of PathHierarchyTokenizerFactory, so that DelimitedPayloadTokenFilterFactory can use it. getFloat() - well, only one factory (NumericPayloadTokenFilterFactory) could use it now, but maybe add it for completeness? A few nits: TestMappingCharFilterFactory's factory could switch to being instantiated using the charFilterFactory() method EdgeNgramTokenizerFactory's gram size constants are pulled from EdgeNgramTokenFilter instead of EdgeNgramTokenizer LimitTokenCountFilterFactory's maxTokenCount param should be required; this is a pre-existing problem though PatternTokenizerFactory's group param should use the getInt() method with a default of -1. +1 to all of this!
          Hide
          Uwe Schindler added a comment -

          Should we create a branch. I also have some ideas to fix...

          Show
          Uwe Schindler added a comment - Should we create a branch. I also have some ideas to fix...
          Hide
          Steve Rowe added a comment -

          Really nice de-cluttering in the tests.

          I don't like the 4-arg required param getXXX() methods in AbstractAnalysisFactory - 4th param as false means required??? - maybe these could be converted to getRequiredXXX() ?

          I think AbstractAnalysisFactory could use additional param parsing methods:

          • get(args, param [, default]) would be a nice addition for strings, instead of args.remove(), which looks different from all the other getXXX() methods. Maybe also a version that takes a set of acceptable values, as well as a boolean for case insensitivity?
          • getEnum(args, param, Enum class [, default] ) - probably case insensitivity could be assumed?
          • getChar() should be pulled out of PathHierarchyTokenizerFactory, so that DelimitedPayloadTokenFilterFactory can use it.
          • getFloat() - well, only one factory (NumericPayloadTokenFilterFactory) could use it now, but maybe add it for completeness?

          A few nits:

          • TestMappingCharFilterFactory's factory could switch to being instantiated using the charFilterFactory() method
          • EdgeNgramTokenizerFactory's gram size constants are pulled from EdgeNgramTokenFilter instead of EdgeNgramTokenizer
          • LimitTokenCountFilterFactory's maxTokenCount param should be required; this is a pre-existing problem though
          • PatternTokenizerFactory's group param should use the getInt() method with a default of -1.

          I can do the work if you agree with these.

          Show
          Steve Rowe added a comment - Really nice de-cluttering in the tests. I don't like the 4-arg required param getXXX() methods in AbstractAnalysisFactory - 4th param as false means required??? - maybe these could be converted to getRequiredXXX() ? I think AbstractAnalysisFactory could use additional param parsing methods: get(args, param [, default] ) would be a nice addition for strings, instead of args.remove(), which looks different from all the other getXXX() methods. Maybe also a version that takes a set of acceptable values, as well as a boolean for case insensitivity? getEnum(args, param, Enum class [, default] ) - probably case insensitivity could be assumed? getChar() should be pulled out of PathHierarchyTokenizerFactory, so that DelimitedPayloadTokenFilterFactory can use it. getFloat() - well, only one factory (NumericPayloadTokenFilterFactory) could use it now, but maybe add it for completeness? A few nits: TestMappingCharFilterFactory's factory could switch to being instantiated using the charFilterFactory() method EdgeNgramTokenizerFactory's gram size constants are pulled from EdgeNgramTokenFilter instead of EdgeNgramTokenizer LimitTokenCountFilterFactory's maxTokenCount param should be required; this is a pre-existing problem though PatternTokenizerFactory's group param should use the getInt() method with a default of -1. I can do the work if you agree with these.
          Hide
          Robert Muir added a comment -

          whew, made it thru the rest. all tests pass.

          Show
          Robert Muir added a comment - whew, made it thru the rest. all tests pass.
          Hide
          Michael McCandless added a comment -

          +1

          Show
          Michael McCandless added a comment - +1
          Hide
          Robert Muir added a comment -

          updated patch: tests in analysis/common are done. I'll look at the other analysis modules and solr tomorrow.

          Show
          Robert Muir added a comment - updated patch: tests in analysis/common are done. I'll look at the other analysis modules and solr tomorrow.
          Hide
          Robert Muir added a comment -

          I have fixed all the factories, but currently I just started fixing tests and adding tests for bogus parameters to all factories.

          I added some helpers to BaseTokenStreamTestCase and fixed TestArabicFilters.java as a prototype...

          slow moving

          Show
          Robert Muir added a comment - I have fixed all the factories, but currently I just started fixing tests and adding tests for bogus parameters to all factories. I added some helpers to BaseTokenStreamTestCase and fixed TestArabicFilters.java as a prototype... slow moving
          Hide
          Robert Muir added a comment -

          Thanks guys. I've begun the "horrible slave work" part

          Hope to have a patch soon.

          Show
          Robert Muir added a comment - Thanks guys. I've begun the "horrible slave work" part Hope to have a patch soon.
          Hide
          Adrien Grand added a comment -

          +1

          Show
          Adrien Grand added a comment - +1
          Hide
          Steve Rowe added a comment -

          +1, patch looks good.

          Show
          Steve Rowe added a comment - +1, patch looks good.
          Hide
          Robert Muir added a comment -

          here's one way we could improve it. There are probably other alternatives that might be better, too.

          I only fixed one factory as its better to decide this before going thru all of them.

          Show
          Robert Muir added a comment - here's one way we could improve it. There are probably other alternatives that might be better, too. I only fixed one factory as its better to decide this before going thru all of them.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development