Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7509

Solr Multilingual Indexing with one field

    Details

    • Type: Wish
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 4.2.1
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      Redhat Linux, 4 core, 12 GB

      Description

      Our current production index size is 1.5 TB with 3 shards. Currently we have the following field type:

      <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
      </analyzer>
      </fieldType>

      And the above field type is working well for the US and English language clients.

      Now we have some new Chinese and Japanese client ,so after google
      http://www.basistech.com/indexing-strategies-for-multilingual-search-with-solr-and-rosette/

      https://docs.lucidworks.com/display/lweug/Multilingual+Indexing+and+Search

      for best approach for multilingual index,there seems to be pros/cons associated with every approach.

      Then i tried RnD with a single field approach and here's my new field type:

      <fieldType name="text_multi" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.CJKBigramFilterFactory"/>
      </analyzer>
      <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.CJKBigramFilterFactory"/>
      <filter class="solr.CustomNGramFilterFactory" minGramSize="3" maxGramSize="30" preserveOriginal="true"/>
      </analyzer>
      </fieldType>

      I have kept the same tokenizer, only changed the filters.And it is working well with all existing search /use-case for English documents as well as new use case for Chinese/Japanese documents.

      Now i have the following questions to the Solr experts/developer:

      1) Is this a correct approach to do it? Or i'm missing something?

      2) Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.

      3) Also is there any problem in future with different clients coming up?

      Please provide some guidance.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kuntalganguly Kuntal Ganguly
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 48h
                48h
                Remaining:
                Remaining Estimate - 48h
                48h
                Logged:
                Time Spent - Not Specified
                Not Specified