Lucene - Core
  1. Lucene - Core
  2. LUCENE-6177

Add CustomAnalyzer - a builder that creates Analyzers from the factory classes

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I prepared some "generic Analyzer class CustomAnalyzer, that makes it easy to build analyzers like in Solr or Elasticsearch. Under the hood it uses the factory classes. The class is made like a builder:

      Analyzer ana = CustomAnalyzer.builder(Path.get("/path/to/config/dir"))
        .withTokenizer("standard")
        .addTokenFilter("standard")
        .addTokenFilter("lowercase")
        .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
        .build();
      

      It is possible to give the resource loader (used by stopwords and similar). By default it tries to load stuff from context classloader (without any class as reference so paths must be absolute - this is the behaviour ClasspathResourseLoader defaults to).

      In addition you can give a Lucene MatchVersion, by default it would use Version.LATEST (once LUCENE-5900 is completely fixed).

      1. LUCENE-6177.patch
        33 kB
        Uwe Schindler
      2. LUCENE-6177.patch
        27 kB
        Uwe Schindler
      3. LUCENE-6177.patch
        20 kB
        Uwe Schindler
      4. LUCENE-6177.patch
        11 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          First patch.

          I have to add tests for it. This patch should just show how it looks like. It may still contain bugs, it was just quickly hacked together.

          Show
          Uwe Schindler added a comment - First patch. I have to add tests for it. This patch should just show how it looks like. It may still contain bugs, it was just quickly hacked together.
          Hide
          Robert Muir added a comment -

          +1 Uwe, looks nice.

          Show
          Robert Muir added a comment - +1 Uwe, looks nice.
          Hide
          Uwe Schindler added a comment -

          Here is a patch with tests. I will look into Solr's TokenizerChain and see how this could be replaced by the generic approach (it is just code duplication).

          But this is something for later, I would like to just provide this with Lucene 5.0, because the number of people who complain about how to make analyzer is quite big on java-user.

          Robert Muir also suggested, that we might implement all current analyzers using this class, because its much easier to read than createComponents() methods.

          Show
          Uwe Schindler added a comment - Here is a patch with tests. I will look into Solr's TokenizerChain and see how this could be replaced by the generic approach (it is just code duplication). But this is something for later, I would like to just provide this with Lucene 5.0, because the number of people who complain about how to make analyzer is quite big on java-user. Robert Muir also suggested, that we might implement all current analyzers using this class, because its much easier to read than createComponents() methods.
          Hide
          Robert Muir added a comment -

          +1 to the patch!

          Show
          Robert Muir added a comment - +1 to the patch!
          Hide
          Uwe Schindler added a comment -

          New patch:

          • allows to set positionIncrement
          • allows to set offsetIncrement
          • allows to get the config back as lists
          Show
          Uwe Schindler added a comment - New patch: allows to set positionIncrement allows to set offsetIncrement allows to get the config back as lists
          Hide
          Uwe Schindler added a comment -

          Patch with a small bugfix (found by new test) and many documentation improvements. I also linked this analyzer in the documentation about Lucene Analysis.

          I will commit this now to 5.0, 5.1 and Trunk.

          Show
          Uwe Schindler added a comment - Patch with a small bugfix (found by new test) and many documentation improvements. I also linked this analyzer in the documentation about Lucene Analysis. I will commit this now to 5.0, 5.1 and Trunk.
          Hide
          ASF subversion and git services added a comment -

          Commit 1651681 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1651681 ]

          LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

          Show
          ASF subversion and git services added a comment - Commit 1651681 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1651681 ] LUCENE-6177 : Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
          Hide
          ASF subversion and git services added a comment -

          Commit 1651687 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651687 ]

          Merged revision(s) 1651681 from lucene/dev/trunk:
          LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

          Show
          ASF subversion and git services added a comment - Commit 1651687 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651687 ] Merged revision(s) 1651681 from lucene/dev/trunk: LUCENE-6177 : Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
          Hide
          ASF subversion and git services added a comment -

          Commit 1651688 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1651688 ]

          Merged revision(s) 1651681 from lucene/dev/trunk:
          LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.

          Show
          ASF subversion and git services added a comment - Commit 1651688 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651688 ] Merged revision(s) 1651681 from lucene/dev/trunk: LUCENE-6177 : Add CustomAnalyzer that allows to configure analyzers like you do in Solr's index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
          Hide
          Uwe Schindler added a comment -

          I committed this now to get it into the coming Lucene 5.0. This is a really new "feature" so it should get its major version. It is also quite "separate", so there is no risk.

          In the future we should use this in Solr (replace TokenizerChain / SolrAnalyzer class). We may alo define our default Analyzers throughout the analysis-common package with this class. I will open separate issues for that.

          Show
          Uwe Schindler added a comment - I committed this now to get it into the coming Lucene 5.0. This is a really new "feature" so it should get its major version. It is also quite "separate", so there is no risk. In the future we should use this in Solr (replace TokenizerChain / SolrAnalyzer class). We may alo define our default Analyzers throughout the analysis-common package with this class. I will open separate issues for that.
          Hide
          ASF subversion and git services added a comment -

          Commit 1651901 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1651901 ]

          LUCENE-6177: fix typo

          Show
          ASF subversion and git services added a comment - Commit 1651901 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1651901 ] LUCENE-6177 : fix typo
          Hide
          ASF subversion and git services added a comment -

          Commit 1651902 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1651902 ]

          Merged revision(s) 1651901 from lucene/dev/trunk:
          LUCENE-6177: fix typo

          Show
          ASF subversion and git services added a comment - Commit 1651902 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651902 ] Merged revision(s) 1651901 from lucene/dev/trunk: LUCENE-6177 : fix typo
          Hide
          ASF subversion and git services added a comment -

          Commit 1651903 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1651903 ]

          Merged revision(s) 1651901 from lucene/dev/trunk:
          LUCENE-6177: fix typo

          Show
          ASF subversion and git services added a comment - Commit 1651903 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651903 ] Merged revision(s) 1651901 from lucene/dev/trunk: LUCENE-6177 : fix typo
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development