Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6177

Add CustomAnalyzer - a builder that creates Analyzers from the factory classes

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I prepared some "generic Analyzer class CustomAnalyzer, that makes it easy to build analyzers like in Solr or Elasticsearch. Under the hood it uses the factory classes. The class is made like a builder:

      Analyzer ana = CustomAnalyzer.builder(Path.get("/path/to/config/dir"))
        .withTokenizer("standard")
        .addTokenFilter("standard")
        .addTokenFilter("lowercase")
        .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
        .build();
      

      It is possible to give the resource loader (used by stopwords and similar). By default it tries to load stuff from context classloader (without any class as reference so paths must be absolute - this is the behaviour ClasspathResourseLoader defaults to).

      In addition you can give a Lucene MatchVersion, by default it would use Version.LATEST (once LUCENE-5900 is completely fixed).

      1. LUCENE-6177.patch
        33 kB
        Uwe Schindler
      2. LUCENE-6177.patch
        27 kB
        Uwe Schindler
      3. LUCENE-6177.patch
        20 kB
        Uwe Schindler
      4. LUCENE-6177.patch
        11 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          anshumg Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          anshumg Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651903 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1651903 ]

          Merged revision(s) 1651901 from lucene/dev/trunk:
          LUCENE-6177: fix typo

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651903 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651903 ] Merged revision(s) 1651901 from lucene/dev/trunk: LUCENE-6177 : fix typo
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651902 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651902 ]

          Merged revision(s) 1651901 from lucene/dev/trunk:
          LUCENE-6177: fix typo

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651902 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651902 ] Merged revision(s) 1651901 from lucene/dev/trunk: LUCENE-6177 : fix typo
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651901 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1651901 ]

          LUCENE-6177: fix typo

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651901 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1651901 ] LUCENE-6177 : fix typo
          Hide
          thetaphi Uwe Schindler added a comment -

          I committed this now to get it into the coming Lucene 5.0. This is a really new "feature" so it should get its major version. It is also quite "separate", so there is no risk.

          In the future we should use this in Solr (replace TokenizerChain / SolrAnalyzer class). We may alo define our default Analyzers throughout the analysis-common package with this class. I will open separate issues for that.

          Show
          thetaphi Uwe Schindler added a comment - I committed this now to get it into the coming Lucene 5.0. This is a really new "feature" so it should get its major version. It is also quite "separate", so there is no risk. In the future we should use this in Solr (replace TokenizerChain / SolrAnalyzer class). We may alo define our default Analyzers throughout the analysis-common package with this class. I will open separate issues for that.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651688 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1651688 ]

          Merged revision(s) 1651681 from lucene/dev/trunk:
          LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651688 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651688 ] Merged revision(s) 1651681 from lucene/dev/trunk: LUCENE-6177 : Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651687 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651687 ]

          Merged revision(s) 1651681 from lucene/dev/trunk:
          LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651687 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651687 ] Merged revision(s) 1651681 from lucene/dev/trunk: LUCENE-6177 : Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1651681 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1651681 ]

          LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1651681 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1651681 ] LUCENE-6177 : Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
          Hide
          thetaphi Uwe Schindler added a comment -

          Patch with a small bugfix (found by new test) and many documentation improvements. I also linked this analyzer in the documentation about Lucene Analysis.

          I will commit this now to 5.0, 5.1 and Trunk.

          Show
          thetaphi Uwe Schindler added a comment - Patch with a small bugfix (found by new test) and many documentation improvements. I also linked this analyzer in the documentation about Lucene Analysis. I will commit this now to 5.0, 5.1 and Trunk.
          Hide
          thetaphi Uwe Schindler added a comment -

          New patch:

          • allows to set positionIncrement
          • allows to set offsetIncrement
          • allows to get the config back as lists
          Show
          thetaphi Uwe Schindler added a comment - New patch: allows to set positionIncrement allows to set offsetIncrement allows to get the config back as lists
          Hide
          rcmuir Robert Muir added a comment -

          +1 to the patch!

          Show
          rcmuir Robert Muir added a comment - +1 to the patch!
          Hide
          thetaphi Uwe Schindler added a comment -

          Here is a patch with tests. I will look into Solr's TokenizerChain and see how this could be replaced by the generic approach (it is just code duplication).

          But this is something for later, I would like to just provide this with Lucene 5.0, because the number of people who complain about how to make analyzer is quite big on java-user.

          Robert Muir also suggested, that we might implement all current analyzers using this class, because its much easier to read than createComponents() methods.

          Show
          thetaphi Uwe Schindler added a comment - Here is a patch with tests. I will look into Solr's TokenizerChain and see how this could be replaced by the generic approach (it is just code duplication). But this is something for later, I would like to just provide this with Lucene 5.0, because the number of people who complain about how to make analyzer is quite big on java-user. Robert Muir also suggested, that we might implement all current analyzers using this class, because its much easier to read than createComponents() methods.
          Hide
          rcmuir Robert Muir added a comment -

          +1 Uwe, looks nice.

          Show
          rcmuir Robert Muir added a comment - +1 Uwe, looks nice.
          Hide
          thetaphi Uwe Schindler added a comment -

          First patch.

          I have to add tests for it. This patch should just show how it looks like. It may still contain bugs, it was just quickly hacked together.

          Show
          thetaphi Uwe Schindler added a comment - First patch. I have to add tests for it. This patch should just show how it looks like. It may still contain bugs, it was just quickly hacked together.

            People

            • Assignee:
              thetaphi Uwe Schindler
              Reporter:
              thetaphi Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development