Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7585

Interface for common parameters used across analysis factories

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 6.3
    • Fix Version/s: 7.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available
    • Flags:
      Patch

      Description

      Certain parameters (String constants) are same/common for multiple analysis factories. Some examples are ignoreCase, dictionary, and preserveOriginal. These string constants are handled inconsistently in different factories. This is an effort to define most common constants in (CommonAnalysisFactoryParams) interface and reuse them.

      1. LUCENE-7585.patch
        83 kB
        Ahmet Arslan
      2. LUCENE-7585.patch
        80 kB
        Ahmet Arslan
      3. LUCENE-7585.patch
        61 kB
        Ahmet Arslan
      4. LUCENE-7585.patch
        58 kB
        Ahmet Arslan
      5. LUCENE-7585.patch
        56 kB
        Ahmet Arslan
      6. LUCENE-7585.patch
        55 kB
        Ahmet Arslan
      7. LUCENE-7585.patch
        48 kB
        Ahmet Arslan
      8. LUCENE-7585.patch
        44 kB
        Ahmet Arslan

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          +1 LGTM. The only thing I'd change is "LUCENE_MATCH_VERSION_PARAM" to "LUCENE_MATCH_VERSION".

          If there are no further comments then I'll commit this later this weekend.

          Show
          dsmiley David Smiley added a comment - +1 LGTM. The only thing I'd change is "LUCENE_MATCH_VERSION_PARAM" to "LUCENE_MATCH_VERSION". If there are no further comments then I'll commit this later this weekend.
          Hide
          steve_rowe Steve Rowe added a comment -

          The patch appears to be against a non-existent version of CommonAnalysisFactoryParams.java, so doesn't apply properly. Manual repair worked for me (saving the patch as the target file, trimming to include just the target file contents, removing - lines, and deleting + prefixes from lines).

          It'd be nice to alphabetize the constants.

          Otherwise, +1.

          Show
          steve_rowe Steve Rowe added a comment - The patch appears to be against a non-existent version of CommonAnalysisFactoryParams.java , so doesn't apply properly. Manual repair worked for me (saving the patch as the target file, trimming to include just the target file contents, removing - lines, and deleting + prefixes from lines). It'd be nice to alphabetize the constants. Otherwise, +1.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Properly created patch that includes proposed changes (alphabetisation and lucene_match_version). Ant documentation-lint complains about factories of icu. Any pointer how to fix it?

          Show
          iorixxx Ahmet Arslan added a comment - Properly created patch that includes proposed changes (alphabetisation and lucene_match_version). Ant documentation-lint complains about factories of icu. Any pointer how to fix it?
          Hide
          iorixxx Ahmet Arslan added a comment -

          Thank you for looking into this. Initially, I was planning to move all existing parameters to a common interface.
          I figured that the interface will grow very large since certain factories have many specific parameters.
          I moved the most common parameters to the interface. However, there still remains a lot in the codebase.
          For example, ngram package has "minGramSize" and "maxGramSize" in common. Phonetic module has "maxCodeLength" and "inject."

          What could be the preferred course of action here?

          • Handle packages and modules locally? If yes how?
          • Move all parameters to the interface unconditionally.
          • Devise an algorithm: Move if a parameter is shared by at least two package or module.
          • ?
          Show
          iorixxx Ahmet Arslan added a comment - Thank you for looking into this. Initially, I was planning to move all existing parameters to a common interface. I figured that the interface will grow very large since certain factories have many specific parameters. I moved the most common parameters to the interface. However, there still remains a lot in the codebase. For example, ngram package has "minGramSize" and "maxGramSize" in common. Phonetic module has "maxCodeLength" and "inject." What could be the preferred course of action here? Handle packages and modules locally? If yes how? Move all parameters to the interface unconditionally. Devise an algorithm: Move if a parameter is shared by at least two package or module. ?
          Hide
          iorixxx Ahmet Arslan added a comment -

          Here is an excerpt from documentation-lint

            [exec] build/docs/analyzers-icu/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerFactory.html
               [exec]   missing Fields: CONSUME_ALL_TOKENS
               [exec]   missing Fields: DELIMITER
               [exec]   missing Fields: DICTIONARY
               [exec]   missing Fields: ENCODER
               [exec]   missing Fields: FORMAT
               [exec]   missing Fields: IGNORE_CASE
               [exec]   missing Fields: LUCENE_MATCH_VERSION
               [exec]   missing Fields: MAX
               [exec]   missing Fields: MAX_TOKEN_LENGTH
               [exec]   missing Fields: MIN
               [exec]   missing Fields: PATTERN
               [exec]   missing Fields: PRESERVE_ORIGINAL
               [exec]   missing Fields: PROTECTED
               [exec]   missing Fields: TYPES
               [exec]   missing Fields: WORDS
               [exec] 
               [exec] Missing javadocs were found!
          
          Show
          iorixxx Ahmet Arslan added a comment - Here is an excerpt from documentation-lint [exec] build/docs/analyzers-icu/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerFactory.html [exec] missing Fields: CONSUME_ALL_TOKENS [exec] missing Fields: DELIMITER [exec] missing Fields: DICTIONARY [exec] missing Fields: ENCODER [exec] missing Fields: FORMAT [exec] missing Fields: IGNORE_CASE [exec] missing Fields: LUCENE_MATCH_VERSION [exec] missing Fields: MAX [exec] missing Fields: MAX_TOKEN_LENGTH [exec] missing Fields: MIN [exec] missing Fields: PATTERN [exec] missing Fields: PRESERVE_ORIGINAL [exec] missing Fields: PROTECTED [exec] missing Fields: TYPES [exec] missing Fields: WORDS [exec] [exec] Missing javadocs were found!
          Hide
          dsmiley David Smiley added a comment -

          RE documentation-lint: It appears they need javadocs then.

          (quoting from your description)

          These string constants are handled inconsistently in different factories

          Can you give an example of how the same parameter is handled differently? AFAICT This patch doesn't address that. Such inconsistencies, if they exist, would be more important to resolve than moving Strings constants to an interface IMO.

          I think limiting the scope of this patch to common parameters is well enough. In doing so, they can be documented in one place (please do that). I'm not sure what value/point there is in doing anything more with other parameters (used by one class), or trying to overcome any inter-package complications. So minGramSize etc. could just be defined in the same common one. At least that's my opinion; it's kind of a bike-shed issue of taste.

          Show
          dsmiley David Smiley added a comment - RE documentation-lint: It appears they need javadocs then. (quoting from your description) These string constants are handled inconsistently in different factories Can you give an example of how the same parameter is handled differently? AFAICT This patch doesn't address that. Such inconsistencies, if they exist, would be more important to resolve than moving Strings constants to an interface IMO. I think limiting the scope of this patch to common parameters is well enough. In doing so, they can be documented in one place (please do that). I'm not sure what value/point there is in doing anything more with other parameters (used by one class), or trying to overcome any inter-package complications. So minGramSize etc. could just be defined in the same common one. At least that's my opinion; it's kind of a bike-shed issue of taste.
          Hide
          iorixxx Ahmet Arslan added a comment -

          By saying inconsistency, I mean the strategy to retrieve those parameters from the arg map. Some use inline string constant e.g. getBoolean(args, "reverse"); others define private or public static final String for the key.

          Show
          iorixxx Ahmet Arslan added a comment - By saying inconsistency, I mean the strategy to retrieve those parameters from the arg map. Some use inline string constant e.g. getBoolean(args, "reverse"); others define private or public static final String for the key.
          Hide
          iorixxx Ahmet Arslan added a comment -

          a few more refactoring including the overlooked code point filter factory.

          Show
          iorixxx Ahmet Arslan added a comment - a few more refactoring including the overlooked code point filter factory.
          Hide
          iorixxx Ahmet Arslan added a comment -

          I tried adding javadocs to fields in the interface, but it did not solve the missing javadocs problem.
          documentation-lint complains/fails for the lucene/analysis/modules, which are explicitly defined with the level of method in lucene/build.xml

              <check-missing-javadocs dir="build/docs/analyzers-icu" level="method"/>
              <check-missing-javadocs dir="build/docs/analyzers-morfologik" level="method"/>
              <check-missing-javadocs dir="build/docs/analyzers-phonetic" level="method"/>
              <check-missing-javadocs dir="build/docs/analyzers-stempel" level="method"/>
              <check-missing-javadocs dir="build/docs/suggest" level="method"/>
          

          I figured that this method=(level|class|none) thing is about checkJavaDocs.py.
          Any pointer how to document interface fields so that level="method" passes in checkJavaDocs.py?

          Or, can we remove above xml fragment from build.xml?

          Show
          iorixxx Ahmet Arslan added a comment - I tried adding javadocs to fields in the interface, but it did not solve the missing javadocs problem. documentation-lint complains/fails for the lucene/analysis/modules, which are explicitly defined with the level of method in lucene/build.xml <check-missing-javadocs dir= "build/docs/analyzers-icu" level= "method" /> <check-missing-javadocs dir= "build/docs/analyzers-morfologik" level= "method" /> <check-missing-javadocs dir= "build/docs/analyzers-phonetic" level= "method" /> <check-missing-javadocs dir= "build/docs/analyzers-stempel" level= "method" /> <check-missing-javadocs dir= "build/docs/suggest" level= "method" /> I figured that this method=(level|class|none) thing is about checkJavaDocs.py . Any pointer how to document interface fields so that level="method" passes in checkJavaDocs.py? Or, can we remove above xml fragment from build.xml?
          Hide
          iorixxx Ahmet Arslan added a comment -

          Patch that adds javadocs. ant documentation-lint still fails for some reason that I cannot figure out.

          Show
          iorixxx Ahmet Arslan added a comment - Patch that adds javadocs. ant documentation-lint still fails for some reason that I cannot figure out.
          Hide
          iorixxx Ahmet Arslan added a comment -

          bring the patch to master.

          Show
          iorixxx Ahmet Arslan added a comment - bring the patch to master.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Finally ant precommit passes with this patch. It checks missing javadocs using level=package for icu, morfologik, phonetic, and suggest.

          Show
          iorixxx Ahmet Arslan added a comment - Finally ant precommit passes with this patch. It checks missing javadocs using level=package for icu, morfologik, phonetic, and suggest.
          Hide
          dsmiley David Smiley added a comment -

          I'm looking over this patch again.

          • CONSUME_ALL_TOKENS is only used by LimitTokenCountFilterFactory thus shouldn't be in CommonAnalysisFactoryParams?
          • Can you please try and say a little something about a parameter in the javadoc other than simply repeating the name (which isn't helpful). For example "dictionary" refers to a file – that's helpful. Perhaps not for min/max or any other intuitive param.
          • I see you chose to use these constants in the tests. I recall Yonik Seeley once recommending that Solr tests should not refer to Solr parameter constants because doing so may make it easier to introduce a back-compat break. That makes sense to me and the concept applies here to.

          BTW nice attention to detail in some classes.... I noticed you corrected some javadocs and added some value set constraints.

          Show
          dsmiley David Smiley added a comment - I'm looking over this patch again. CONSUME_ALL_TOKENS is only used by LimitTokenCountFilterFactory thus shouldn't be in CommonAnalysisFactoryParams? Can you please try and say a little something about a parameter in the javadoc other than simply repeating the name (which isn't helpful). For example "dictionary" refers to a file – that's helpful. Perhaps not for min/max or any other intuitive param. I see you chose to use these constants in the tests. I recall Yonik Seeley once recommending that Solr tests should not refer to Solr parameter constants because doing so may make it easier to introduce a back-compat break. That makes sense to me and the concept applies here to. BTW nice attention to detail in some classes.... I noticed you corrected some javadocs and added some value set constraints.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Sorry for the huge delay. This patch addresses the issues raised by David.

          • consumeAllTokens is used by LimitTokenOffset and LimitTokenPosition too.
          • applies Yonik's concept
          • improved javadoc. Some arguments are difficult since they have different meanings in different components.
          • covers a few more overlooked analysis factories
          • spotted a copy-paste mistake

          Any feedback is appreciated.

          Show
          iorixxx Ahmet Arslan added a comment - Sorry for the huge delay. This patch addresses the issues raised by David. consumeAllTokens is used by LimitTokenOffset and LimitTokenPosition too. applies Yonik's concept improved javadoc. Some arguments are difficult since they have different meanings in different components. covers a few more overlooked analysis factories spotted a copy-paste mistake Any feedback is appreciated.
          Hide
          iorixxx Ahmet Arslan added a comment -

          Fix TestStopFilterFactory and TestSuggestStopFilterFactory failure

          Show
          iorixxx Ahmet Arslan added a comment - Fix TestStopFilterFactory and TestSuggestStopFilterFactory failure

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              iorixxx Ahmet Arslan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development