Solr
  1. Solr
  2. SOLR-3097

Introduce default Japanese stoptags and stopwords to Solr's example configuration

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      SOLR-3056 discusses introducing a default field type text_ja for Japanese in schema.xml. This configuration will be improved by also introducing default stopwords and stoptags configuration for the field type.

      I believe this configuration should be easily available and tunable to Solr users and I'm proposing that we introduce the same stopwords and stoptags provided in LUCENE-3745 to Solr example configuration. I'm proposing that files can live in solr/example/solr/conf as stopwords_ja.txt and stoptags_ja.txt alongside stopwords_en.txt for English. (Longer term, I think should reconsider our overall approach to this across all languages, but that's perhaps a separate discussion.)

      1. SOLR-3097.patch
        1 kB
        Robert Muir
      2. SOLR-3097.patch
        19 kB
        Christian Moen

        Issue Links

          Activity

          Hide
          Christian Moen added a comment -

          Patch for trunk and branch_3x attached.

          Show
          Christian Moen added a comment - Patch for trunk and branch_3x attached.
          Hide
          Robert Muir added a comment -

          (Longer term, I think should reconsider our overall approach to this across all languages, but that's perhaps a separate discussion.)

          It is a larger issue... in general we should make it easier to keep the two synchronized, but off the top of my head an idea for a plan was:

          • add 'snowball format' support to solr stopfilter so it can read all the lucene stopwords directly
          • add an ant task to synchronize the solr example from lucene's resources.
          • (of course) add fieldtypes that actually use all these files.

          On the other hand, realistically these resources are pretty static (don't change once added). So for now I don't think its a huge
          risk that we don't have an auto-sync process... but we need to tackle these problems to easily integrate european languages anyway.

          So I dont think this should block this issue, lets get japanese up and going for now.

          Show
          Robert Muir added a comment - (Longer term, I think should reconsider our overall approach to this across all languages, but that's perhaps a separate discussion.) It is a larger issue... in general we should make it easier to keep the two synchronized, but off the top of my head an idea for a plan was: add 'snowball format' support to solr stopfilter so it can read all the lucene stopwords directly add an ant task to synchronize the solr example from lucene's resources. (of course) add fieldtypes that actually use all these files. On the other hand, realistically these resources are pretty static (don't change once added). So for now I don't think its a huge risk that we don't have an auto-sync process... but we need to tackle these problems to easily integrate european languages anyway. So I dont think this should block this issue, lets get japanese up and going for now.
          Hide
          Christian Moen added a comment -

          Thanks a lot, Robert.

          Show
          Christian Moen added a comment - Thanks a lot, Robert.
          Hide
          Robert Muir added a comment -

          ok this ant task was easy enough to write...

          here's my first stab at it.

          Show
          Robert Muir added a comment - ok this ant task was easy enough to write... here's my first stab at it.
          Hide
          Christian Moen added a comment -

          Thanks, Robert.

          Is your thinking to use the sync-analyzers target to automatically copy resources to the right place as part of package, example, etc. – or is this as convenience to easier make sure the files are in sync when we check them in separately?

          The sync-analyzers works fine for the latter purpose, but needs hookups elsewhere in build.xml if we want to do this automatically. Happy to follow up on the latter if this is what you'd like to see in the patch.

          Show
          Christian Moen added a comment - Thanks, Robert. Is your thinking to use the sync-analyzers target to automatically copy resources to the right place as part of package , example , etc. – or is this as convenience to easier make sure the files are in sync when we check them in separately? The sync-analyzers works fine for the latter purpose, but needs hookups elsewhere in build.xml if we want to do this automatically. Happy to follow up on the latter if this is what you'd like to see in the patch.
          Hide
          Robert Muir added a comment -

          I think it should be a convenience? Because the files do rarely change...

          And I fear any automated method would end out just overwriting peoples work
          if they try to tweak these files.

          Show
          Robert Muir added a comment - I think it should be a convenience? Because the files do rarely change... And I fear any automated method would end out just overwriting peoples work if they try to tweak these files.
          Hide
          Christian Moen added a comment -

          Robert, I agree.

          Would a patch that contains your build.xml changes and the synched stopwords_ja.txt and stoptags_ja.txt files be a suitable next step? Please advise. Many thanks.

          Show
          Christian Moen added a comment - Robert, I agree. Would a patch that contains your build.xml changes and the synched stopwords_ja.txt and stoptags_ja.txt files be a suitable next step? Please advise. Many thanks.
          Hide
          Robert Muir added a comment -

          I think we are ready to move forward here actually.

          The only modification I want to do is to put this stuff in the conf/lang/ directory,
          instead of conf/ directly.

          I created SOLR-3105 which uses the same scheme across other languages and it seems
          much more organized this way.

          Show
          Robert Muir added a comment - I think we are ready to move forward here actually. The only modification I want to do is to put this stuff in the conf/lang/ directory, instead of conf/ directly. I created SOLR-3105 which uses the same scheme across other languages and it seems much more organized this way.

            People

            • Assignee:
              Unassigned
              Reporter:
              Christian Moen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development