Solr
  1. Solr
  2. SOLR-3245

Poor performance of Hunspell with Polish Dictionary

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.8, 5.0
    • Component/s: Schema and Analysis
    • Labels:
    • Environment:

      Centos 6.2, kernel 2.6.32, 2 physical CPU Xeon 5606 (4 cores each), 32 GB RAM, 2 SSD disks in RAID 0, java version 1.6.0_26, java settings -server -Xms4096M -Xmx4096M

      Description

      In Solr 4.0 Hunspell stemmer with polish dictionary has poor performance whereas performance of hunspell from http://code.google.com/p/lucene-hunspell/ in solr 3.4 is very good.

      Tests shows:

      Solr 3.4, full import 489017 documents:

      StempelPolishStemFilterFactory - 2908 seconds, 168 docs/sec
      HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

      Solr 4.0, full import 489017 documents:

      StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
      HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

      My schema is quit easy. For Hunspell I have one text field I copy 14 text fields to:

      "<field name="text" type="text_pl_hunspell" indexed="true" stored="false" multiValued="true"/>"
      
      <copyField source="field1" dest="text"/>  
      ....
      <copyField source="field14" dest="text"/>
      

      The "text_pl_hunspell" configuration:

      <fieldType name="text_pl_hunspell" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
              <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords_pl.txt"/-->
            </analyzer>
            <analyzer type="query">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.HunspellStemFilterFactory" dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
              <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
            </analyzer>
          </fieldType>
      

      I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt, synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same files I used in 3.4 version.

      For Polish Stemmer the diffrence is only in definion text field:

      "<field name="text" type="text_pl" indexed="true" stored="false" multiValued="true"/>"
      
          <fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StempelPolishStemFilterFactory"/>
              <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
            </analyzer>
            <analyzer type="query">
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.SynonymFilterFactory" synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
              <filter class="solr.StopFilterFactory"
                      ignoreCase="true"
                      words="dict/stopwords_pl.txt"
                      enablePositionIncrements="true"
                      />
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StempelPolishStemFilterFactory"/>
              <filter class="solr.KeywordMarkerFilterFactory" protected="dict/protwords_pl.txt"/>
            </analyzer>
          </fieldType>
      

      One document has 23 fields:

      • 14 text fields copy to one text field (above) that is only indexed
      • 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The size of one document is 3-4 kB.
      1. pl_PL.zip
        1.07 MB
        Agnieszka

        Activity

        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0
        Hide
        Robert Muir added a comment -

        I've been fixing several bugs in this thing recently for the 4.8 release. I don't know what bug was happening here, but I am guessing it mostly involved correctness issues (LUCENE-5483) resulting in bad stems, too, which will cause crazy search results.

        I compared performance of the 4.7 release with the current code in branch_4x (to be 4.8). For the corpus I used the first 10k news snippets from the polish corpus here: http://www.corpora.heliohost.org/

        Version Indexing Speed (docs/second) Number of tokens (sumTotalTermFreq) RAM usage
        4.7 71.1 635117 50.9MB
        4.8 909.3 456499 2MB

        So I think the performance issues are fixed. As you can see, this polish dictionary was definitely impacted by correctness issues, and this over-recursion no longer happens.

        Show
        Robert Muir added a comment - I've been fixing several bugs in this thing recently for the 4.8 release. I don't know what bug was happening here, but I am guessing it mostly involved correctness issues ( LUCENE-5483 ) resulting in bad stems, too, which will cause crazy search results. I compared performance of the 4.7 release with the current code in branch_4x (to be 4.8). For the corpus I used the first 10k news snippets from the polish corpus here: http://www.corpora.heliohost.org/ Version Indexing Speed (docs/second) Number of tokens (sumTotalTermFreq) RAM usage 4.7 71.1 635117 50.9MB 4.8 909.3 456499 2MB So I think the performance issues are fixed. As you can see, this polish dictionary was definitely impacted by correctness issues, and this over-recursion no longer happens.
        Hide
        Romain MERESSE added a comment -

        Any update on this issue? This problem is still present in 4.2

        Show
        Romain MERESSE added a comment - Any update on this issue? This problem is still present in 4.2
        Hide
        Romain MERESSE added a comment - - edited

        Same problem here, with French dictionary in Solr 3.6

        With Hunspell : ~5 documents/s
        Without Hunspell : ~280 documents/s

        Someone got a solution ? ...
        Quite sad as this is a very important feature (stemming is poor with Snowball)

        Show
        Romain MERESSE added a comment - - edited Same problem here, with French dictionary in Solr 3.6 With Hunspell : ~5 documents/s Without Hunspell : ~280 documents/s Someone got a solution ? ... Quite sad as this is a very important feature (stemming is poor with Snowball)
        Hide
        Ales Perme added a comment -

        An update: I downloaded the latest dictionaries from http://extensions.services.openoffice.org/dictionary and unpacked the oxt (it is actually a zip file), took out the .dic and .aff files and got better speeds: 55 docs/sec and 90 docs/sec if I disable WordDelimiterFilterFactory. Stemming is more important than WordDelimiterFilterFactory for me. I hope this helps in any way.

        Show
        Ales Perme added a comment - An update: I downloaded the latest dictionaries from http://extensions.services.openoffice.org/dictionary and unpacked the oxt (it is actually a zip file), took out the .dic and .aff files and got better speeds: 55 docs/sec and 90 docs/sec if I disable WordDelimiterFilterFactory. Stemming is more important than WordDelimiterFilterFactory for me. I hope this helps in any way.
        Hide
        Ales Perme added a comment -

        Hi! I have the same problem with Slovenian dictionary in SOLR version 3.6. Performance comparisons:

        SOLR 3.1 + Hunspell: indexing speed 285 documents/s
        SOLR 3.6 + Hunspell: indexing speed 23 documents/s.
        SOLR 3.6 without Hunspell: indexing speed 110 documents/s.

        Wierd...

        SCHEMA:
        <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
        <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="dictionaries/sl_SI.dic" affix="dictionaries/sl_SI.aff" ignoreCase="true"/>
        </analyzer>
        <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="dictionaries/sl_SI.dic" affix="dictionaries/sl_SI.aff" ignoreCase="true"/>
        </analyzer>
        </fieldType>

        Show
        Ales Perme added a comment - Hi! I have the same problem with Slovenian dictionary in SOLR version 3.6. Performance comparisons: SOLR 3.1 + Hunspell: indexing speed 285 documents/s SOLR 3.6 + Hunspell: indexing speed 23 documents/s. SOLR 3.6 without Hunspell: indexing speed 110 documents/s. Wierd... SCHEMA: <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.HunspellStemFilterFactory" dictionary="dictionaries/sl_SI.dic" affix="dictionaries/sl_SI.aff" ignoreCase="true"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" stemEnglishPossessive="0" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.HunspellStemFilterFactory" dictionary="dictionaries/sl_SI.dic" affix="dictionaries/sl_SI.aff" ignoreCase="true"/> </analyzer> </fieldType>
        Hide
        Agnieszka added a comment -

        I made one more test for Hunspell with english dictionary (from OpenOffice.org) in Solr 4.0. It seems that the problem not exists with the english dictionary.

        Solr 4.0, full import 489017 documents, hunspell, english dictionary:

        3146 seconds, 155 docs/sec

        But I'm not sure if it is reliable because I use documents with polish text to test english dictionary.

        Show
        Agnieszka added a comment - I made one more test for Hunspell with english dictionary (from OpenOffice.org) in Solr 4.0. It seems that the problem not exists with the english dictionary. Solr 4.0, full import 489017 documents, hunspell, english dictionary: 3146 seconds, 155 docs/sec But I'm not sure if it is reliable because I use documents with polish text to test english dictionary.
        Hide
        Agnieszka added a comment -

        Polish dictionary for Hunspell

        Show
        Agnieszka added a comment - Polish dictionary for Hunspell

          People

          • Assignee:
            Unassigned
            Reporter:
            Agnieszka
          • Votes:
            5 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development