Solr
  1. Solr
  2. SOLR-3272

Solr filter factory for MorfologikFilter

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      I didn't find MorfologikFilter factory in Solr, so here is a simple. Maybe someone will have make use of it

      1. SOLR-3727-new.patch
        5 kB
        Rafał Kuć
      2. SOLR-3272-with-javadoc-example-usage.patch
        5 kB
        Rafał Kuć
      3. SOLR-3272-toupper-correction.patch
        5 kB
        Rafał Kuć
      4. SOLR-3272.patch
        4 kB
        Rafał Kuć
      5. SOLR-3272.patch
        10 kB
        Dawid Weiss
      6. SOLR-3272.patch
        12 kB
        Dawid Weiss

        Activity

        Hide
        Rafał Kuć added a comment -

        Patch with MorfologikFilterFactory and test added.

        Show
        Rafał Kuć added a comment - Patch with MorfologikFilterFactory and test added.
        Hide
        Dawid Weiss added a comment - - edited

        Hi Rafał. Could you modify this patch to include support for the three dictionaries (combined, morfeusz and morfologik)? This would be more flexible (and the combined dictionary is nearly twice larger than morfologik itself so it's worth it).

        return new MorfologikFilter(ts, DICTIONARY.MORFOLOGIK, luceneMatchVersion);
        

        Also, an example of use in the JavaDoc would be nice (see BeiderMorseFilterFactory for example). The test should be using DEFAULT_VERSION not the fixed LUCENE_40. Thanks!

        Show
        Dawid Weiss added a comment - - edited Hi Rafał. Could you modify this patch to include support for the three dictionaries (combined, morfeusz and morfologik)? This would be more flexible (and the combined dictionary is nearly twice larger than morfologik itself so it's worth it). return new MorfologikFilter(ts, DICTIONARY.MORFOLOGIK, luceneMatchVersion); Also, an example of use in the JavaDoc would be nice (see BeiderMorseFilterFactory for example). The test should be using DEFAULT_VERSION not the fixed LUCENE_40. Thanks!
        Hide
        Rafał Kuć added a comment -

        Sure Dawid, no problem. I'll provide a patch later today.

        Show
        Rafał Kuć added a comment - Sure Dawid, no problem. I'll provide a patch later today.
        Hide
        Dawid Weiss added a comment -

        Thanks. Sorry about the name confusion btw. Don't know where I took Michał from

        Show
        Dawid Weiss added a comment - Thanks. Sorry about the name confusion btw. Don't know where I took Michał from
        Hide
        Rafał Kuć added a comment -

        Dawid, I've uploaded new patch. One thing I noted is that the ant build throws an error because it can't find Morfologik libraries.

        I'll patch the build files tomorrow.

        Show
        Rafał Kuć added a comment - Dawid, I've uploaded new patch. One thing I noted is that the ant build throws an error because it can't find Morfologik libraries. I'll patch the build files tomorrow.
        Hide
        Dawid Weiss added a comment -

        Thanks Rafał.

        Show
        Dawid Weiss added a comment - Thanks Rafał.
        Hide
        Rafał Kuć added a comment -

        Dawid, I've looked at the build files and how libs are handled. My proposition is to copy the jar's from the modules/morfologik/lib to solr/contrib/analysis-extras/lib similar to the way the ICU lib is handled. What do you think about that ?

        I've also updated the patch, to have javadocs example usage.

        Show
        Rafał Kuć added a comment - Dawid, I've looked at the build files and how libs are handled. My proposition is to copy the jar's from the modules/morfologik/lib to solr/contrib/analysis-extras/lib similar to the way the ICU lib is handled. What do you think about that ? I've also updated the patch, to have javadocs example usage.
        Hide
        Dawid Weiss added a comment -

        I actually don't know what the policy is – I asked on the dev list, we'll see what solr folks prefer.

        Show
        Dawid Weiss added a comment - I actually don't know what the policy is – I asked on the dev list, we'll see what solr folks prefer.
        Hide
        Uwe Schindler added a comment -

        Hi, the patch has one problem (unrelated to the JAR file problem): It uses toUpperCase() without locale, so uppercasing with e.g. Turkish locale will fail horrible. This type of code must use toUpperCase(Locale.ENGLISH).

        It the JAR file already included with analysis module? I don't really understand the problem with the JAR file.

        Show
        Uwe Schindler added a comment - Hi, the patch has one problem (unrelated to the JAR file problem): It uses toUpperCase() without locale, so uppercasing with e.g. Turkish locale will fail horrible. This type of code must use toUpperCase(Locale.ENGLISH). It the JAR file already included with analysis module? I don't really understand the problem with the JAR file.
        Hide
        Uwe Schindler added a comment -

        Ah I see, the dictionaries are already there. In that case we would not duplicate them inside the Solr tree. It should stay in analysis module. We can move them during build and war-file creation to solr, like we do with other lucene jars.

        Show
        Uwe Schindler added a comment - Ah I see, the dictionaries are already there. In that case we would not duplicate them inside the Solr tree. It should stay in analysis module. We can move them during build and war-file creation to solr, like we do with other lucene jars.
        Hide
        Rafał Kuć added a comment -

        I'll update the patch and the build files, thanks for comments.

        Show
        Rafał Kuć added a comment - I'll update the patch and the build files, thanks for comments.
        Hide
        Dawid Weiss added a comment -

        Thanks Uwe. Btw. should we apply it to 3.x as well? This seems like a harmless patch and it'd be a nice-to-have feature.

        Show
        Dawid Weiss added a comment - Thanks Uwe. Btw. should we apply it to 3.x as well? This seems like a harmless patch and it'd be a nice-to-have feature.
        Hide
        Robert Muir added a comment -

        There is no morfologik in 3.x. I asked about that before but you seemed a tad offended
        at the prospect of a java5 version

        Show
        Robert Muir added a comment - There is no morfologik in 3.x. I asked about that before but you seemed a tad offended at the prospect of a java5 version
        Hide
        Dawid Weiss added a comment -

        Damn. [Blushing].

        I could prepare a 1.5 compatible version with retroweaver and integrate it in. I guess now I don't have excuses, do I... Do we want to push it in at the last minute though?

        Show
        Dawid Weiss added a comment - Damn. [Blushing] . I could prepare a 1.5 compatible version with retroweaver and integrate it in. I guess now I don't have excuses, do I... Do we want to push it in at the last minute though?
        Hide
        Robert Muir added a comment -

        I have no objection, as long as all new files:

        • have package.html
        • have some javadocs for each class
        • have apache license header

        and assuming its in by tomorrow (when i will freeze).

        Stuff like new analyzers that weren't available before aren't really risky,
        because they won't break any existing functionality.

        Show
        Robert Muir added a comment - I have no objection, as long as all new files: have package.html have some javadocs for each class have apache license header and assuming its in by tomorrow (when i will freeze). Stuff like new analyzers that weren't available before aren't really risky, because they won't break any existing functionality.
        Hide
        Rafał Kuć added a comment -

        Corrected toUpper to use Locale.ENGLISH

        Show
        Rafał Kuć added a comment - Corrected toUpper to use Locale.ENGLISH
        Hide
        Dawid Weiss added a comment -

        I attach a patch that introduces subtle changes (enum.valueOf never returns null, it throws an exception on invalid values).

        There is also my take at build integration although I have no idea if I did it right – where is the code that copies these JARs to solr? The code compiles for me but the war file doesn't have any contribs, for example (they're optional?).

        Show
        Dawid Weiss added a comment - I attach a patch that introduces subtle changes (enum.valueOf never returns null, it throws an exception on invalid values). There is also my take at build integration although I have no idea if I did it right – where is the code that copies these JARs to solr? The code compiles for me but the war file doesn't have any contribs, for example (they're optional?).
        Hide
        Dawid Weiss added a comment -

        Can I ask somebody to look at the build file changes (and determine if morfologik JARs should be copied and where). Otherwise this is ready to be committed I think.

        After some deliberation I won't rush to make Morfologik part of 3.x – last minute features are the worst.

        Show
        Dawid Weiss added a comment - Can I ask somebody to look at the build file changes (and determine if morfologik JARs should be copied and where). Otherwise this is ready to be committed I think. After some deliberation I won't rush to make Morfologik part of 3.x – last minute features are the worst.
        Hide
        Steve Rowe added a comment - - edited

        where is the code that copies these JARs to solr?

        The dist target in solr/webapp/build.xml depends on lucene-jars-to-solr in solr/common-build.xml, which makes sure that lucene .jar dependencies are built (via prep-lucene-jars), and then copies them to solr/build/lucene-libs/, the contents of which are then packaged into the .war. Dawid, I think this is where you want to make changes.

        More packaging details:

        The dist target in solr/webapp/build.xml also runs the contribs-add-to-war target from solr/common-build.xml, which invokes add-to-war in each Solr contrib that has a src/webapp/ directory (currently only DIH) to copy the contents of src/webapp/ to solr/build/web/, the contents of which are then packaged into the .war.

        Lastly, the create-package target in solr/build.xml depends on the dist target, which depends on the dist-contrib target, which invokes dist in each Solr contrib. Solr contribs that want to include lucene .jar dependencies in the Solr distribution (but not the .war) have their dist targets populate solr/build/contrib/<contrib-name>/lucene-libs/ with those .jars. The create-package target then invokes add-lucene-libs-to-package in each Solr contrib, which copies the contents of solr/build/contrib/<contrib-name>/lucene-libs/ to solr/build/contrib-lucene-libs-to-package/contrib/<contrib-name>/. Finally, the distributions (tarball/zip) include the contents of solr/build/contrib-lucene-libs-to-package/.

        Wow, that last part is way more complex than it needs to be. (And it's totally my fault )

        Before Robert and I rewrote the Solr build, the top-level Solr build file was huge. One of the guiding principles I used was "keep build configuration local". Now individual modules handle the details that only concern them. Coordination among modules, as seen in the above packaging description, remains a challenge...

        Show
        Steve Rowe added a comment - - edited where is the code that copies these JARs to solr? The dist target in solr/webapp/build.xml depends on lucene-jars-to-solr in solr/common-build.xml , which makes sure that lucene .jar dependencies are built (via prep-lucene-jars ), and then copies them to solr/build/lucene-libs/ , the contents of which are then packaged into the .war. Dawid, I think this is where you want to make changes. More packaging details: The dist target in solr/webapp/build.xml also runs the contribs-add-to-war target from solr/common-build.xml , which invokes add-to-war in each Solr contrib that has a src/webapp/ directory (currently only DIH) to copy the contents of src/webapp/ to solr/build/web/ , the contents of which are then packaged into the .war. Lastly, the create-package target in solr/build.xml depends on the dist target, which depends on the dist-contrib target, which invokes dist in each Solr contrib. Solr contribs that want to include lucene .jar dependencies in the Solr distribution (but not the .war) have their dist targets populate solr/build/contrib/<contrib-name>/lucene-libs/ with those .jars. The create-package target then invokes add-lucene-libs-to-package in each Solr contrib, which copies the contents of solr/build/contrib/<contrib-name>/lucene-libs/ to solr/build/contrib-lucene-libs-to-package/contrib/<contrib-name>/ . Finally, the distributions (tarball/zip) include the contents of solr/build/contrib-lucene-libs-to-package/ . Wow, that last part is way more complex than it needs to be. (And it's totally my fault ) Before Robert and I rewrote the Solr build, the top-level Solr build file was huge . One of the guiding principles I used was "keep build configuration local". Now individual modules handle the details that only concern them. Coordination among modules, as seen in the above packaging description, remains a challenge...
        Hide
        Dawid Weiss added a comment -

        Final patch (applied).

        Show
        Dawid Weiss added a comment - Final patch (applied).
        Hide
        Dawid Weiss added a comment -

        Applied the patch. Thanks for clarifications concerning build deps, Steven. I've built and dist and the JARs make it under WEB-INF/lib so all seems good.

        Show
        Dawid Weiss added a comment - Applied the patch. Thanks for clarifications concerning build deps, Steven. I've built and dist and the JARs make it under WEB-INF/lib so all seems good.
        Hide
        Dawid Weiss added a comment -

        This is in trunk now, thanks Rafał.

        Show
        Dawid Weiss added a comment - This is in trunk now, thanks Rafał.

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Rafał Kuć
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development