Solr
  1. Solr
  2. SOLR-3245

Poor performance of Hunspell with Polish Dictionary

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.8, 5.0
    • Component/s: Schema and Analysis
    • Labels:
    • Environment:

      Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M

      Description

      In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good.

      Tests shows:

      Solr 3.4, full import 489017 documents:

      StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec
      HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

      Solr 4.0, full import 489017 documents:

      StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
      HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

      My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:

      "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
      
      <copyField source="field1" dest="text"/>  
      ....
      <copyField source="field14" dest="text"/>
      

      The "text_pl_hunspell" configuration:

      <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
              <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
            </analyzer>
            <analyzer type="query">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
              <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
            </analyzer>
          </fieldType>
      

      I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version.

      For Polish Stemmer the diffrence is only in definion text field:

      "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
      
          <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StempelPolishStemFilterFactory"/>
              <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
            </analyzer>
            <analyzer type="query">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StempelPolishStemFilterFactory"/>
              <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
            </analyzer>
          </fieldType>
      

      One document has 23 fields:

      • 14 text fields copy to one text field (above) that is only indexed
      • 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.
      1. pl_PL.zip
        1.07 MB
        Agnieszka

        Activity

          People

          • Assignee:
            Unassigned
            Reporter:
            Agnieszka
          • Votes:
            5 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development